How to iterate over consecutive chunks of Pandas dataframe efficiently
Use numpy's array_split():
import numpy as npimport pandas as pddata = pd.DataFrame(np.random.rand(10, 3))for chunk in np.array_split(data, 5): assert len(chunk) == len(data) / 5, "This assert may fail for the last chunk if data lenght isn't divisible by 5"
I'm not sure if this is exactly what you want, but I found these grouper functions on another SO thread fairly useful for doing a multiprocessor pool.
Here's a short example from that thread, which might do something like what you want:
import numpy as npimport pandas as pdsdf = pds.DataFrame(np.random.rand(14,4), columns=['a', 'b', 'c', 'd'])def chunker(seq, size): return (seq[pos:pos + size] for pos in xrange(0, len(seq), size))for i in chunker(df,5): print i
Which gives you something like this:
a b c d0 0.860574 0.059326 0.339192 0.7863991 0.029196 0.395613 0.524240 0.3802652 0.235759 0.164282 0.350042 0.8770043 0.545394 0.881960 0.994079 0.7212794 0.584504 0.648308 0.655147 0.511390 a b c d5 0.276160 0.982803 0.451825 0.8453636 0.728453 0.246870 0.515770 0.3434797 0.971947 0.278430 0.006910 0.8885128 0.044888 0.875791 0.842361 0.8906759 0.200563 0.246080 0.333202 0.574488 a b c d10 0.971125 0.106790 0.274001 0.96057911 0.722224 0.575325 0.465267 0.25897612 0.574039 0.258625 0.469209 0.88676813 0.915423 0.713076 0.073338 0.622967
I hope that helps.
EDIT
In this case, I used this function with pool of processors in (approximately) this manner:
from multiprocessing import Poolnprocs = 4pool = Pool(nprocs)for chunk in chunker(df, nprocs): data = pool.map(myfunction, chunk) data.domorestuff()
I assume this should be very similar to using the IPython distributed machinery, but I haven't tried it.
In practice, you can't guarantee equal-sized chunks. The number of rows (N) might be prime, in which case you could only get equal-sized chunks at 1 or N. Because of this, real-world chunking typically uses a fixed size and allows for a smaller chunk at the end. I tend to pass an array to groupby
. Starting from:
>>> df = pd.DataFrame(np.random.rand(15, 5), index=[0]*15)>>> df[0] = range(15)>>> df 0 1 2 3 40 0 0.746300 0.346277 0.220362 0.1726800 1 0.657324 0.687169 0.384196 0.2141180 2 0.016062 0.858784 0.236364 0.963389[...]0 13 0.510273 0.051608 0.230402 0.7569210 14 0.950544 0.576539 0.642602 0.907850[15 rows x 5 columns]
where I've deliberately made the index uninformative by setting it to 0, we simply decide on our size (here 10) and integer-divide an array by it:
>>> df.groupby(np.arange(len(df))//10)<pandas.core.groupby.DataFrameGroupBy object at 0xb208492c>>>> for k,g in df.groupby(np.arange(len(df))//10):... print(k,g)... 0 0 1 2 3 40 0 0.746300 0.346277 0.220362 0.1726800 1 0.657324 0.687169 0.384196 0.2141180 2 0.016062 0.858784 0.236364 0.963389[...]0 8 0.241049 0.246149 0.241935 0.5634280 9 0.493819 0.918858 0.193236 0.266257[10 rows x 5 columns]1 0 1 2 3 40 10 0.037693 0.370789 0.369117 0.4010410 11 0.721843 0.862295 0.671733 0.605006[...]0 14 0.950544 0.576539 0.642602 0.907850[5 rows x 5 columns]
Methods based on slicing the DataFrame can fail when the index isn't compatible with that, although you can always use .iloc[a:b]
to ignore the index values and access data by position.