Python out of memory on large CSV file (numpy)

As other folks have mentioned, for a really large file, you're better off iterating.

However, you do commonly want the entire thing in memory for various reasons.

genfromtxt is much less efficient than loadtxt (though it handles missing data, whereas loadtxt is more "lean and mean", which is why the two functions co-exist).

If your data is very regular (e.g. just simple delimited rows of all the same type), you can also improve on either by using numpy.fromiter.

If you have enough ram, consider using np.loadtxt('yourfile.txt', delimiter=',') (You may also need to specify skiprows if you have a header on the file.)

As a quick comparison, loading ~500MB text file with loadtxt uses ~900MB of ram at peak usage, while loading the same file with genfromtxt uses ~2.5GB.

Loadtxt Memory and CPU usage of numpy.loadtxt while loading a ~500MB ascii file

Genfromtxt Memory and CPU usage of numpy.genfromtxt while loading a ~500MB ascii file

Alternately, consider something like the following. It will only work for very simple, regular data, but it's quite fast. (loadtxt and genfromtxt do a lot of guessing and error-checking. If your data is very simple and regular, you can improve on them greatly.)

import numpy as npdef generate_text_file(length=1e6, ncols=20):    data = np.random.random((length, ncols))    np.savetxt('large_text_file.csv', data, delimiter=',')def iter_loadtxt(filename, delimiter=',', skiprows=0, dtype=float):    def iter_func():        with open(filename, 'r') as infile:            for _ in range(skiprows):                next(infile)            for line in infile:                line = line.rstrip().split(delimiter)                for item in line:                    yield dtype(item)        iter_loadtxt.rowlength = len(line)    data = np.fromiter(iter_func(), dtype=dtype)    data = data.reshape((-1, iter_loadtxt.rowlength))    return data#generate_text_file()data = iter_loadtxt('large_text_file.csv')

Fromiter

Using fromiter to load the same ~500MB data file

python memory csv numpy scipy

The problem with using genfromtxt() is that it attempts to load the whole file into memory, i.e. into a numpy array. This is great for small files but BAD for 3GB inputs like yours. Since you are just calculating column medians, there's no need to read the whole file. A simple, but not the most efficient way to do it would be to read the whole file line-by-line multiple times and iterate over the columns.

python memory csv numpy scipy

Why are you not using the python csv module?

>> import csv>> reader = csv.reader(open('All.csv'))>>> for row in reader:...     print row

CodeHunter

Python out of memory on large CSV file (numpy)

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last