How do I read a large csv file with pandas? How do I read a large csv file with pandas? python python

How do I read a large csv file with pandas?


The error shows that the machine does not have enough memory to read the entireCSV into a DataFrame at one time. Assuming you do not need the entire dataset inmemory all at one time, one way to avoid the problem would be to process the CSV inchunks (by specifying the chunksize parameter):

chunksize = 10 ** 6for chunk in pd.read_csv(filename, chunksize=chunksize):    process(chunk)

The chunksize parameter specifies the number of rows per chunk.(The last chunk may contain fewer than chunksize rows, of course.)


pandas >= 1.2

read_csv with chunksize returns a context manager, to be used like so:

chunksize = 10 ** 6with pd.read_csv(filename, chunksize=chunksize) as reader:    for chunk in reader:        process(chunk)

See GH38225


Chunking shouldn't always be the first port of call for this problem.

  1. Is the file large due to repeated non-numeric data or unwanted columns?

    If so, you can sometimes see massive memory savings by reading in columns as categories and selecting required columns via pd.read_csv usecols parameter.

  2. Does your workflow require slicing, manipulating, exporting?

    If so, you can use dask.dataframe to slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API.

  3. If all else fails, read line by line via chunks.

    Chunk via pandas or via csv library as a last resort.


For large data l recommend you use the library "dask"
e.g:

# Dataframes implement the Pandas APIimport dask.dataframe as dddf = dd.read_csv('s3://.../2018-*-*.csv')

You can read more from the documentation here.

Another great alternative would be to use modin because all the functionality is identical to pandas yet it leverages on distributed dataframe libraries such as dask.