pandas data frame - select rows and clear memory?

python memory memory-management memory-leaks pandas

You are much better off doing something like this:

Specify usecols to sub-select which columns you want in the first place to read_csv, see here.

Then read the file in chunks, see here, if the rows that you want are select, shunt them to off, finally concatenating the result.

Pseudo-code ish:

reader = pd.read_csv('big_table.txt', sep='\t', header=0,                      index_col=0, usecols=the_columns_i_want_to_use,                      chunksize=10000)df = pd.concat([ chunk.iloc[rows_that_I_want_] for chunk in reader ])

This will have a constant memory usage (the size of a chunk)

plus the selected rows usage x 2, which will happen when you concat the rowsafter the concat the usage will go down to selected rows usage

python memory memory-management memory-leaks pandas

I've had a similar problem, I solved it with a filtering data before loading. When you read the file with read.table you are loading the whole in a DataFrame, and maybe also the whole file in memory or some duplication becouse the use of different types, so this is the 6GB used.

You could make a generator to load the contents of the file line by line, I assume that the data it's row based, one record is one row and one line in big_table.txt, so

def big_table_generator(filename):    with open(filename, 'rt') as f:        for line in f:            if is_needed_row(line):   #Check if you want this row                #cut_columns() return a list with only the selected columns                record = cut_columns(line)                    yield columngen = big_table_generator('big_table.txt')df = pandas.DataFrame.from_records(list(gen))

Note the list(gen), pandas 0.12 and previous version don't allow generators so you have to convert it to a list so all the data provided by generator it's put on memory. 0.13 will do the same thing internally. Also you need twice the memory of the data you need, one for load the data and one for put it into pandas NDframe structure.

You also could make the generator to read from a compressed file, with python 3.3 gzip library only decompress the needed chuncks.

CodeHunter

pandas data frame - select rows and clear memory?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last