How to create a pivot table on extremely large dataframes in Pandas

You could do the appending with HDF5/pytables. This keeps it out of RAM.

store = pd.HDFStore('store.h5')for ...:    ...    chunk  # the chunk of the DataFrame (which you want to append)    store.append('df', chunk)

Now you can read it in as a DataFrame in one go (assuming this DataFrame can fit in memory!):

df = store['df']

You can also query, to get only subsections of the DataFrame.

Aside: You should also buy more RAM, it's cheap.

Edit: you can groupby/sum from the store iteratively since this "map-reduces" over the chunks:

# note: this doesn't work, see belowsum(df.groupby().sum() for df in store.select('df', chunksize=50000))# equivalent to (but doesn't read in the entire frame)store['df'].groupby().sum()

Edit2: Using sum as above doesn't actually work in pandas 0.16 (I thought it did in 0.15.2), instead you can use reduce with add:

reduce(lambda x, y: x.add(y, fill_value=0),       (df.groupby().sum() for df in store.select('df', chunksize=50000)))

In python 3 you must import reduce from functools.

Perhaps it's more pythonic/readable to write this as:

chunks = (df.groupby().sum() for df in store.select('df', chunksize=50000))res = next(chunks)  # will raise if there are no chunks!for c in chunks:    res = res.add(c, fill_value=0)

If performance is poor / if there are a large number of new groups then it may be preferable to start the res as zero of the correct size (by getting the unique group keys e.g. by looping through the chunks), and then add in place.

CodeHunter

How to create a pivot table on extremely large dataframes in Pandas

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last