How to create a pivot table on extremely large dataframes in Pandas How to create a pivot table on extremely large dataframes in Pandas python python

How to create a pivot table on extremely large dataframes in Pandas


You could do the appending with HDF5/pytables. This keeps it out of RAM.

Use the table format:

store = pd.HDFStore('store.h5')for ...:    ...    chunk  # the chunk of the DataFrame (which you want to append)    store.append('df', chunk)

Now you can read it in as a DataFrame in one go (assuming this DataFrame can fit in memory!):

df = store['df']

You can also query, to get only subsections of the DataFrame.

Aside: You should also buy more RAM, it's cheap.


Edit: you can groupby/sum from the store iteratively since this "map-reduces" over the chunks:

# note: this doesn't work, see belowsum(df.groupby().sum() for df in store.select('df', chunksize=50000))# equivalent to (but doesn't read in the entire frame)store['df'].groupby().sum()

Edit2: Using sum as above doesn't actually work in pandas 0.16 (I thought it did in 0.15.2), instead you can use reduce with add:

reduce(lambda x, y: x.add(y, fill_value=0),       (df.groupby().sum() for df in store.select('df', chunksize=50000)))

In python 3 you must import reduce from functools.

Perhaps it's more pythonic/readable to write this as:

chunks = (df.groupby().sum() for df in store.select('df', chunksize=50000))res = next(chunks)  # will raise if there are no chunks!for c in chunks:    res = res.add(c, fill_value=0)

If performance is poor / if there are a large number of new groups then it may be preferable to start the res as zero of the correct size (by getting the unique group keys e.g. by looping through the chunks), and then add in place.