Can memmap pandas series. What about a dataframe?

python pandas numpy multidimensional-array numpy-memmap

OK ... after a lot of digging here's what's going on.Pandas' DataFrame uses the BlockManager class to organize the data internally. Contrary to the docs, DataFrame is NOT a collection of series but a collection of similarly dtyped matrices. BlockManger groups all the float columns together, all the int columns together, etc..., and their memory (from what I can tell) is kept together.

It can do that without copying the memory ONLY if a single ndarray matrix (a single type) is provided. Note, BlockManager (in theory) also supports not-copying mixed type data in its construction as it may not be necessary to copy this input into same-typed chunked. However, the DataFrame constructor doesn't make a copy ONLY if a single matrix is the data parameter.

In short, if you have mixed types or multiple arrays as input to the constructor, or a provide a dict with a single array, you are out of luck in Pandas, and DataFrame's default BlockManager will copy your data.

In any case, one way to work around this is to force BlockManager to not consolidate-by-type, but to keep each column as a separate 'block'. So, with monkey-patching magic...

        from pandas.core.internals import BlockManager        class BlockManagerUnconsolidated(BlockManager):            def __init__(self, *args, **kwargs):                BlockManager.__init__(self, *args, **kwargs)                self._is_consolidated = False                self._known_consolidated = False            def _consolidate_inplace(self): pass            def _consolidate(self): return self.blocks        def df_from_arrays(arrays, columns, index):            from pandas.core.internals import make_block            def gen():                _len = None                p = 0                for a in arrays:                    if _len is None:                        _len = len(a)                        assert len(index) == _len                    assert _len == len(a)                    yield make_block(values=a.reshape((1,_len)), placement=(p,))                    p+=1            blocks = tuple(gen())            mgr = BlockManagerUnconsolidated(blocks=blocks, axes=[columns, index])            return pd.DataFrame(mgr, copy=False)

It would be better if DataFrame or BlockManger had a consolidate=False (or assumed this behavior) if copy=False was specified.

To test:

    def assert_readonly(iloc):       try:           iloc[0] = 999 # Should be non-editable           raise Exception("MUST BE READ ONLY (1)")       except ValueError as e:           assert "read-only" in e.message    # Original ndarray    n = 1000    _arr = np.arange(0,1000, dtype=float)    # Convert it to a memmap    mm = np.memmap(filename, mode='w+', shape=_arr.shape, dtype=_arr.dtype)    mm[:] = _arr[:]    del _arr    mm.flush()    mm.flags['WRITEABLE'] = False  # Make immutable!        df = df_from_arrays(            [mm, mm, mm],            columns=['a', 'b', 'c'],            index=range(len(mm)))        assert_read_only(df["a"].iloc)        assert_read_only(df["b"].iloc)        assert_read_only(df["c"].iloc)

It seems a little questionable to me whether there's really practical benefits to BlockManager requiring similarly typed data to be kept together -- most of the operations in Pandas are label-row-wise, or per column -- this follows from a DataFrame being a structure of heterogeneous columns that are usually only associated by their index. Though feasibly they're keeping one index per 'block', gaining benefit if the index keeps offsets into the block (if this was the case, then they should groups by sizeof(dtype), which I don't think is the case). Ho hum...

There was some discussion about a PR to provide a non-copying constructor, which was abandoned.

It looks like there's sensible plans to phase out BlockManager, so your mileage many vary.

Also see Pandas under the hood, which helped me a lot.

python pandas numpy multidimensional-array numpy-memmap

If you change your DataFrame constructor to add the parameter copy=False you will have the behavior you want.https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

Edit: Also, you want to use the underlying ndarray (rather than the pandas series).

CodeHunter

Can memmap pandas series. What about a dataframe?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last