pandas data frame - select rows and clear memory? pandas data frame - select rows and clear memory? pandas pandas

pandas data frame - select rows and clear memory?


You are much better off doing something like this:

Specify usecols to sub-select which columns you want in the first place to read_csv, see here.

Then read the file in chunks, see here, if the rows that you want are select, shunt them to off, finally concatenating the result.

Pseudo-code ish:

reader = pd.read_csv('big_table.txt', sep='\t', header=0,                      index_col=0, usecols=the_columns_i_want_to_use,                      chunksize=10000)df = pd.concat([ chunk.iloc[rows_that_I_want_] for chunk in reader ])

This will have a constant memory usage (the size of a chunk)

plus the selected rows usage x 2, which will happen when you concat the rowsafter the concat the usage will go down to selected rows usage


I've had a similar problem, I solved it with a filtering data before loading. When you read the file with read.table you are loading the whole in a DataFrame, and maybe also the whole file in memory or some duplication becouse the use of different types, so this is the 6GB used.

You could make a generator to load the contents of the file line by line, I assume that the data it's row based, one record is one row and one line in big_table.txt, so

def big_table_generator(filename):    with open(filename, 'rt') as f:        for line in f:            if is_needed_row(line):   #Check if you want this row                #cut_columns() return a list with only the selected columns                record = cut_columns(line)                    yield columngen = big_table_generator('big_table.txt')df = pandas.DataFrame.from_records(list(gen))

Note the list(gen), pandas 0.12 and previous version don't allow generators so you have to convert it to a list so all the data provided by generator it's put on memory. 0.13 will do the same thing internally. Also you need twice the memory of the data you need, one for load the data and one for put it into pandas NDframe structure.

You also could make the generator to read from a compressed file, with python 3.3 gzip library only decompress the needed chuncks.