Memory error with large data sets for pandas.concat and numpy.append Memory error with large data sets for pandas.concat and numpy.append pandas pandas

Memory error with large data sets for pandas.concat and numpy.append


This is essentially what you are doing. Note that it doesn't make much difference from a memory perspective if you do conversition to DataFrames before or after.

But you can specify dtype='float32' to effectively 1/2 your memory.

In [45]: np.concatenate([ np.random.uniform(size=2000 * 1000).astype('float32').reshape(2000,1000) for i in xrange(50) ]).nbytesOut[45]: 400000000In [46]: np.concatenate([ np.random.uniform(size=2000 * 1000).reshape(2000,1000) for i in xrange(50) ]).nbytesOut[46]: 800000000In [47]: DataFrame(np.concatenate([ np.random.uniform(size=2000 * 1000).reshape(2000,1000) for i in xrange(50) ]))Out[47]: <class 'pandas.core.frame.DataFrame'>Int64Index: 100000 entries, 0 to 99999Columns: 1000 entries, 0 to 999dtypes: float64(1000)


A straightforward (but using the hard drive) way would be to simply use shelve (a hard drive dict): http://docs.python.org/2/library/shelve.html


As suggested by usethedeathstar, Boud and Jeff in the comments, switching to a 64-bit python does the trick.
If losing precision is not an issue, using float32 data type as suggested by Jeff also increases the amount of data that can be processed in a 32-bit environment.