Memory error when using pandas read_csv Memory error when using pandas read_csv windows windows

Memory error when using pandas read_csv


Windows memory limitation

Memory errors happens a lot with python when using the 32bit version in Windows. This is because 32bit processes only gets 2GB of memory to play with by default.

Tricks for lowering memory usage

If you are not using 32bit python in windows but are looking to improve on your memory efficiency while reading csv files, there is a trick.

The pandas.read_csv function takes an option called dtype. This lets pandas know what types exist inside your csv data.

How this works

By default, pandas will try to guess what dtypes your csv file has. This is a very heavy operation because while it is determining the dtype, it has to keep all raw data as objects (strings) in memory.

Example

Let's say your csv looks like this:

name, age, birthdayAlice, 30, 1985-01-01Bob, 35, 1980-01-01Charlie, 25, 1990-01-01

This example is of course no problem to read into memory, but it's just an example.

If pandas were to read the above csv file without any dtype option, the age would be stored as strings in memory until pandas has read enough lines of the csv file to make a qualified guess.

I think the default in pandas is to read 1,000,000 rows before guessing the dtype.

Solution

By specifying dtype={'age':int} as an option to the .read_csv() will let pandas know that age should be interpreted as a number. This saves you lots of memory.

Problem with corrupt data

However, if your csv file would be corrupted, like this:

name, age, birthdayAlice, 30, 1985-01-01Bob, 35, 1980-01-01Charlie, 25, 1990-01-01Dennis, 40+, None-Ur-Bz

Then specifying dtype={'age':int} will break the .read_csv() command, because it cannot cast "40+" to int. So sanitize your data carefully!

Here you can see how the memory usage of a pandas dataframe is a lot higher when floats are kept as strings:

Try it yourself

df = pd.DataFrame(pd.np.random.choice(['1.0', '0.6666667', '150000.1'],(100000, 10)))resource.getrusage(resource.RUSAGE_SELF).ru_maxrss# 224544 (~224 MB)df = pd.DataFrame(pd.np.random.choice([1.0, 0.6666667, 150000.1],(100000, 10)))resource.getrusage(resource.RUSAGE_SELF).ru_maxrss# 79560 (~79 MB)


I had the same memory problem with a simple read of a tab delimited text file around 1 GB in size (over 5.5 million records) and this solved the memory problem:

df = pd.read_csv(myfile,sep='\t') # didn't work, memory errordf = pd.read_csv(myfile,sep='\t',low_memory=False) # worked fine and in less than 30 seconds

Spyder 3.2.3Python 2.7.13 64bits


I tried chunksize while reading big CSV file

reader = pd.read_csv(filePath,chunksize=1000000,low_memory=False,header=0)

The read is now the list. We can iterate the reader and write/append to the new csv or can perform any operation

for chunk in reader:    print(newChunk.columns)    print("Chunk -> File process")    with open(destination, 'a') as f:        newChunk.to_csv(f, header=False,sep='\t',index=False)        print("Chunk appended to the file")