Large, persistent DataFrame in pandas Large, persistent DataFrame in pandas python python

Large, persistent DataFrame in pandas


Wes is of course right! I'm just chiming in to provide a little more complete example code. I had the same issue with a 129 Mb file, which was solved by:

import pandas as pdtp = pd.read_csv('large_dataset.csv', iterator=True, chunksize=1000)  # gives TextFileReader, which is iterable with chunks of 1000 rows.df = pd.concat(tp, ignore_index=True)  # df is DataFrame. If errors, do `list(tp)` instead of `tp`


In principle it shouldn't run out of memory, but there are currently memory problems with read_csv on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).

At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000) then concatenate then with pd.concat. The problem comes in when you pull the entire text file into memory in one big slurp.


This is an older thread, but I just wanted to dump my workaround solution here. I initially tried the chunksize parameter (even with quite small values like 10000), but it didn't help much; had still technical issues with the memory size (my CSV was ~ 7.5 Gb).

Right now, I just read chunks of the CSV files in a for-loop approach and add them e.g., to an SQLite database step by step:

import pandas as pdimport sqlite3from pandas.io import sqlimport subprocess# In and output file pathsin_csv = '../data/my_large.csv'out_sqlite = '../data/my.sqlite'table_name = 'my_table' # name for the SQLite database tablechunksize = 100000 # number of lines to process at each iteration# columns that should be read from the CSV filecolumns = ['molecule_id','charge','db','drugsnow','hba','hbd','loc','nrb','smiles']# Get number of lines in the CSV filenlines = subprocess.check_output('wc -l %s' % in_csv, shell=True)nlines = int(nlines.split()[0]) # connect to databasecnx = sqlite3.connect(out_sqlite)# Iteratively read CSV and dump lines into the SQLite tablefor i in range(0, nlines, chunksize):    df = pd.read_csv(in_csv,              header=None,  # no header, define column header manually later            nrows=chunksize, # number of rows to read at each iteration            skiprows=i)   # skip rows that were already read    # columns to read            df.columns = columns    sql.to_sql(df,                 name=table_name,                 con=cnx,                 index=False, # don't use CSV file index                index_label='molecule_id', # use a unique column from DataFrame as index                if_exists='append') cnx.close()