How do I read a large csv file with pandas?
The error shows that the machine does not have enough memory to read the entireCSV into a DataFrame at one time. Assuming you do not need the entire dataset inmemory all at one time, one way to avoid the problem would be to process the CSV inchunks (by specifying the
chunksize = 10 ** 6for chunk in pd.read_csv(filename, chunksize=chunksize): process(chunk)
chunksize parameter specifies the number of rows per chunk.(The last chunk may contain fewer than
chunksize rows, of course.)
pandas >= 1.2
chunksize returns a context manager, to be used like so:
chunksize = 10 ** 6with pd.read_csv(filename, chunksize=chunksize) as reader: for chunk in reader: process(chunk)
Chunking shouldn't always be the first port of call for this problem.
Is the file large due to repeated non-numeric data or unwanted columns?
Does your workflow require slicing, manipulating, exporting?
If so, you can use dask.dataframe to slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API.
If all else fails, read line by line via chunks.
For large data l recommend you use the library "dask"
# Dataframes implement the Pandas APIimport dask.dataframe as dddf = dd.read_csv('s3://.../2018-*-*.csv')
You can read more from the documentation here.
Another great alternative would be to use modin because all the functionality is identical to pandas yet it leverages on distributed dataframe libraries such as dask.