Memory error when using pandas read_csv

Windows memory limitation

Memory errors happens a lot with python when using the 32bit version in Windows. This is because 32bit processes only gets 2GB of memory to play with by default.

Tricks for lowering memory usage

If you are not using 32bit python in windows but are looking to improve on your memory efficiency while reading csv files, there is a trick.

The pandas.read_csv function takes an option called dtype. This lets pandas know what types exist inside your csv data.

How this works

By default, pandas will try to guess what dtypes your csv file has. This is a very heavy operation because while it is determining the dtype, it has to keep all raw data as objects (strings) in memory.

Example

Let's say your csv looks like this:

name, age, birthdayAlice, 30, 1985-01-01Bob, 35, 1980-01-01Charlie, 25, 1990-01-01

This example is of course no problem to read into memory, but it's just an example.

If pandas were to read the above csv file without any dtype option, the age would be stored as strings in memory until pandas has read enough lines of the csv file to make a qualified guess.

I think the default in pandas is to read 1,000,000 rows before guessing the dtype.

Solution

By specifying dtype={'age':int} as an option to the .read_csv() will let pandas know that age should be interpreted as a number. This saves you lots of memory.

Problem with corrupt data

However, if your csv file would be corrupted, like this:

name, age, birthdayAlice, 30, 1985-01-01Bob, 35, 1980-01-01Charlie, 25, 1990-01-01Dennis, 40+, None-Ur-Bz

Then specifying dtype={'age':int} will break the .read_csv() command, because it cannot cast "40+" to int. So sanitize your data carefully!

Here you can see how the memory usage of a pandas dataframe is a lot higher when floats are kept as strings:

Try it yourself

df = pd.DataFrame(pd.np.random.choice(['1.0', '0.6666667', '150000.1'],(100000, 10)))resource.getrusage(resource.RUSAGE_SELF).ru_maxrss# 224544 (~224 MB)df = pd.DataFrame(pd.np.random.choice([1.0, 0.6666667, 150000.1],(100000, 10)))resource.getrusage(resource.RUSAGE_SELF).ru_maxrss# 79560 (~79 MB)

python windows pandas

I had the same memory problem with a simple read of a tab delimited text file around 1 GB in size (over 5.5 million records) and this solved the memory problem：

df = pd.read_csv(myfile,sep='\t') # didn't work, memory errordf = pd.read_csv(myfile,sep='\t',low_memory=False) # worked fine and in less than 30 seconds

Spyder 3.2.3Python 2.7.13 64bits

python windows pandas

I tried chunksize while reading big CSV file

reader = pd.read_csv(filePath,chunksize=1000000,low_memory=False,header=0)

The read is now the list. We can iterate the reader and write/append to the new csv or can perform any operation

for chunk in reader:    print(newChunk.columns)    print("Chunk -> File process")    with open(destination, 'a') as f:        newChunk.to_csv(f, header=False,sep='\t',index=False)        print("Chunk appended to the file")

CodeHunter

Memory error when using pandas read_csv

Windows memory limitation

Tricks for lowering memory usage

How this works

Example

Solution

Problem with corrupt data

Try it yourself

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last