How to iterate over consecutive chunks of Pandas dataframe efficiently How to iterate over consecutive chunks of Pandas dataframe efficiently python python

How to iterate over consecutive chunks of Pandas dataframe efficiently


Use numpy's array_split():

import numpy as npimport pandas as pddata = pd.DataFrame(np.random.rand(10, 3))for chunk in np.array_split(data, 5):  assert len(chunk) == len(data) / 5, "This assert may fail for the last chunk if data lenght isn't divisible by 5"


I'm not sure if this is exactly what you want, but I found these grouper functions on another SO thread fairly useful for doing a multiprocessor pool.

Here's a short example from that thread, which might do something like what you want:

import numpy as npimport pandas as pdsdf = pds.DataFrame(np.random.rand(14,4), columns=['a', 'b', 'c', 'd'])def chunker(seq, size):    return (seq[pos:pos + size] for pos in xrange(0, len(seq), size))for i in chunker(df,5):    print i

Which gives you something like this:

          a         b         c         d0  0.860574  0.059326  0.339192  0.7863991  0.029196  0.395613  0.524240  0.3802652  0.235759  0.164282  0.350042  0.8770043  0.545394  0.881960  0.994079  0.7212794  0.584504  0.648308  0.655147  0.511390          a         b         c         d5  0.276160  0.982803  0.451825  0.8453636  0.728453  0.246870  0.515770  0.3434797  0.971947  0.278430  0.006910  0.8885128  0.044888  0.875791  0.842361  0.8906759  0.200563  0.246080  0.333202  0.574488           a         b         c         d10  0.971125  0.106790  0.274001  0.96057911  0.722224  0.575325  0.465267  0.25897612  0.574039  0.258625  0.469209  0.88676813  0.915423  0.713076  0.073338  0.622967

I hope that helps.

EDIT

In this case, I used this function with pool of processors in (approximately) this manner:

from multiprocessing import Poolnprocs = 4pool = Pool(nprocs)for chunk in chunker(df, nprocs):    data = pool.map(myfunction, chunk)    data.domorestuff()

I assume this should be very similar to using the IPython distributed machinery, but I haven't tried it.


In practice, you can't guarantee equal-sized chunks. The number of rows (N) might be prime, in which case you could only get equal-sized chunks at 1 or N. Because of this, real-world chunking typically uses a fixed size and allows for a smaller chunk at the end. I tend to pass an array to groupby. Starting from:

>>> df = pd.DataFrame(np.random.rand(15, 5), index=[0]*15)>>> df[0] = range(15)>>> df    0         1         2         3         40   0  0.746300  0.346277  0.220362  0.1726800   1  0.657324  0.687169  0.384196  0.2141180   2  0.016062  0.858784  0.236364  0.963389[...]0  13  0.510273  0.051608  0.230402  0.7569210  14  0.950544  0.576539  0.642602  0.907850[15 rows x 5 columns]

where I've deliberately made the index uninformative by setting it to 0, we simply decide on our size (here 10) and integer-divide an array by it:

>>> df.groupby(np.arange(len(df))//10)<pandas.core.groupby.DataFrameGroupBy object at 0xb208492c>>>> for k,g in df.groupby(np.arange(len(df))//10):...     print(k,g)...     0    0         1         2         3         40  0  0.746300  0.346277  0.220362  0.1726800  1  0.657324  0.687169  0.384196  0.2141180  2  0.016062  0.858784  0.236364  0.963389[...]0  8  0.241049  0.246149  0.241935  0.5634280  9  0.493819  0.918858  0.193236  0.266257[10 rows x 5 columns]1     0         1         2         3         40  10  0.037693  0.370789  0.369117  0.4010410  11  0.721843  0.862295  0.671733  0.605006[...]0  14  0.950544  0.576539  0.642602  0.907850[5 rows x 5 columns]

Methods based on slicing the DataFrame can fail when the index isn't compatible with that, although you can always use .iloc[a:b] to ignore the index values and access data by position.