How to iterate over consecutive chunks of Pandas dataframe efficiently

python pandas parallel-processing ipython

Use numpy's array_split():

import numpy as npimport pandas as pddata = pd.DataFrame(np.random.rand(10, 3))for chunk in np.array_split(data, 5):  assert len(chunk) == len(data) / 5, "This assert may fail for the last chunk if data lenght isn't divisible by 5"

python pandas parallel-processing ipython

I'm not sure if this is exactly what you want, but I found these grouper functions on another SO thread fairly useful for doing a multiprocessor pool.

Here's a short example from that thread, which might do something like what you want:

import numpy as npimport pandas as pdsdf = pds.DataFrame(np.random.rand(14,4), columns=['a', 'b', 'c', 'd'])def chunker(seq, size):    return (seq[pos:pos + size] for pos in xrange(0, len(seq), size))for i in chunker(df,5):    print i

Which gives you something like this:

          a         b         c         d0  0.860574  0.059326  0.339192  0.7863991  0.029196  0.395613  0.524240  0.3802652  0.235759  0.164282  0.350042  0.8770043  0.545394  0.881960  0.994079  0.7212794  0.584504  0.648308  0.655147  0.511390          a         b         c         d5  0.276160  0.982803  0.451825  0.8453636  0.728453  0.246870  0.515770  0.3434797  0.971947  0.278430  0.006910  0.8885128  0.044888  0.875791  0.842361  0.8906759  0.200563  0.246080  0.333202  0.574488           a         b         c         d10  0.971125  0.106790  0.274001  0.96057911  0.722224  0.575325  0.465267  0.25897612  0.574039  0.258625  0.469209  0.88676813  0.915423  0.713076  0.073338  0.622967

I hope that helps.

EDIT

In this case, I used this function with pool of processors in (approximately) this manner:

from multiprocessing import Poolnprocs = 4pool = Pool(nprocs)for chunk in chunker(df, nprocs):    data = pool.map(myfunction, chunk)    data.domorestuff()

I assume this should be very similar to using the IPython distributed machinery, but I haven't tried it.

python pandas parallel-processing ipython

In practice, you can't guarantee equal-sized chunks. The number of rows (N) might be prime, in which case you could only get equal-sized chunks at 1 or N. Because of this, real-world chunking typically uses a fixed size and allows for a smaller chunk at the end. I tend to pass an array to groupby. Starting from:

>>> df = pd.DataFrame(np.random.rand(15, 5), index=[0]*15)>>> df[0] = range(15)>>> df    0         1         2         3         40   0  0.746300  0.346277  0.220362  0.1726800   1  0.657324  0.687169  0.384196  0.2141180   2  0.016062  0.858784  0.236364  0.963389[...]0  13  0.510273  0.051608  0.230402  0.7569210  14  0.950544  0.576539  0.642602  0.907850[15 rows x 5 columns]

where I've deliberately made the index uninformative by setting it to 0, we simply decide on our size (here 10) and integer-divide an array by it:

>>> df.groupby(np.arange(len(df))//10)<pandas.core.groupby.DataFrameGroupBy object at 0xb208492c>>>> for k,g in df.groupby(np.arange(len(df))//10):...     print(k,g)...     0    0         1         2         3         40  0  0.746300  0.346277  0.220362  0.1726800  1  0.657324  0.687169  0.384196  0.2141180  2  0.016062  0.858784  0.236364  0.963389[...]0  8  0.241049  0.246149  0.241935  0.5634280  9  0.493819  0.918858  0.193236  0.266257[10 rows x 5 columns]1     0         1         2         3         40  10  0.037693  0.370789  0.369117  0.4010410  11  0.721843  0.862295  0.671733  0.605006[...]0  14  0.950544  0.576539  0.642602  0.907850[5 rows x 5 columns]

Methods based on slicing the DataFrame can fail when the index isn't compatible with that, although you can always use .iloc[a:b] to ignore the index values and access data by position.

CodeHunter

How to iterate over consecutive chunks of Pandas dataframe efficiently

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last