Large, persistent DataFrame in pandas
Wes is of course right! I'm just chiming in to provide a little more complete example code. I had the same issue with a 129 Mb file, which was solved by:
import pandas as pdtp = pd.read_csv('large_dataset.csv', iterator=True, chunksize=1000) # gives TextFileReader, which is iterable with chunks of 1000 rows.df = pd.concat(tp, ignore_index=True) # df is DataFrame. If errors, do `list(tp)` instead of `tp`
In principle it shouldn't run out of memory, but there are currently memory problems with read_csv
on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).
At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap
), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000
) then concatenate then with pd.concat
. The problem comes in when you pull the entire text file into memory in one big slurp.
This is an older thread, but I just wanted to dump my workaround solution here. I initially tried the chunksize
parameter (even with quite small values like 10000), but it didn't help much; had still technical issues with the memory size (my CSV was ~ 7.5 Gb).
Right now, I just read chunks of the CSV files in a for-loop approach and add them e.g., to an SQLite database step by step:
import pandas as pdimport sqlite3from pandas.io import sqlimport subprocess# In and output file pathsin_csv = '../data/my_large.csv'out_sqlite = '../data/my.sqlite'table_name = 'my_table' # name for the SQLite database tablechunksize = 100000 # number of lines to process at each iteration# columns that should be read from the CSV filecolumns = ['molecule_id','charge','db','drugsnow','hba','hbd','loc','nrb','smiles']# Get number of lines in the CSV filenlines = subprocess.check_output('wc -l %s' % in_csv, shell=True)nlines = int(nlines.split()[0]) # connect to databasecnx = sqlite3.connect(out_sqlite)# Iteratively read CSV and dump lines into the SQLite tablefor i in range(0, nlines, chunksize): df = pd.read_csv(in_csv, header=None, # no header, define column header manually later nrows=chunksize, # number of rows to read at each iteration skiprows=i) # skip rows that were already read # columns to read df.columns = columns sql.to_sql(df, name=table_name, con=cnx, index=False, # don't use CSV file index index_label='molecule_id', # use a unique column from DataFrame as index if_exists='append') cnx.close()