Large, persistent DataFrame in pandas

python pandas sas

Wes is of course right! I'm just chiming in to provide a little more complete example code. I had the same issue with a 129 Mb file, which was solved by:

import pandas as pdtp = pd.read_csv('large_dataset.csv', iterator=True, chunksize=1000)  # gives TextFileReader, which is iterable with chunks of 1000 rows.df = pd.concat(tp, ignore_index=True)  # df is DataFrame. If errors, do `list(tp)` instead of `tp`

python pandas sas

In principle it shouldn't run out of memory, but there are currently memory problems with read_csv on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).

At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000) then concatenate then with pd.concat. The problem comes in when you pull the entire text file into memory in one big slurp.

python pandas sas

This is an older thread, but I just wanted to dump my workaround solution here. I initially tried the chunksize parameter (even with quite small values like 10000), but it didn't help much; had still technical issues with the memory size (my CSV was ~ 7.5 Gb).

Right now, I just read chunks of the CSV files in a for-loop approach and add them e.g., to an SQLite database step by step:

import pandas as pdimport sqlite3from pandas.io import sqlimport subprocess# In and output file pathsin_csv = '../data/my_large.csv'out_sqlite = '../data/my.sqlite'table_name = 'my_table' # name for the SQLite database tablechunksize = 100000 # number of lines to process at each iteration# columns that should be read from the CSV filecolumns = ['molecule_id','charge','db','drugsnow','hba','hbd','loc','nrb','smiles']# Get number of lines in the CSV filenlines = subprocess.check_output('wc -l %s' % in_csv, shell=True)nlines = int(nlines.split()[0]) # connect to databasecnx = sqlite3.connect(out_sqlite)# Iteratively read CSV and dump lines into the SQLite tablefor i in range(0, nlines, chunksize):    df = pd.read_csv(in_csv,              header=None,  # no header, define column header manually later            nrows=chunksize, # number of rows to read at each iteration            skiprows=i)   # skip rows that were already read    # columns to read            df.columns = columns    sql.to_sql(df,                 name=table_name,                 con=cnx,                 index=False, # don't use CSV file index                index_label='molecule_id', # use a unique column from DataFrame as index                if_exists='append') cnx.close()

CodeHunter

Large, persistent DataFrame in pandas

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last