Memory usage accumulated in while loop Memory usage accumulated in while loop pandas pandas

Memory usage accumulated in while loop


Your DataFrame is simply too big to deal with. The memory load doubles while you're executing this line:

A = A.loc[(A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)]

That's because you're assigning a new value to A, so memory is allocated for a new DataFrame while the old one is being filtered. The new one is almost the same size as the old one because you're selecting almost all of the data points. That eats up enough memory for two copies of A, and that's without taking account of extra memory for the bookkeeping that the loc implementation does.

Apparently loc causes pandas to allocate enough memory for an additional copy of the data. I'm not sure why this is. I presume it's a performance optimization of some sort. This means that you end up consuming, at peak memory usage, four times the size of the DataFrame. Once loc is done and the unallocated memory is freed—which you can force by calling gc.collect()—the memory load drops down to double the size of the DataFrame. On the next call to loc, everything is doubled and you're back up to a quadruple load. Garbage-collect again, and you're back down to double. This will continue as long as you like.

To verify what's going on, run this modified version of your code:

import numpy as npimport pandas as pdimport gcA = pd.DataFrame({'score':np.random.random(32346018), 'coordinate':np.arange(1, 32346019)})peak_centers = []peak_scores = []exclusion = 147count = 0while A.shape[0] > 0:    gc.collect()  # Force garbage collection.    count += 1    # Increment the iteration count.    print('iteration %d, shape %s' % (count, A.shape))    raw_input()   # Wait for the user to press Enter.    idx = A.score.values.argmax()    one_center = A.coordinate.iloc[idx]    # peak_centers and peak_scores are python lists    peak_centers.append(one_center)    peak_scores.append(A.score.iloc[idx])    print(len(peak_centers), len(peak_scores))    # exclude the coordinates around the selected peak    A = A.loc[(A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)]

Press the Enter key between iterations and keep an eye on memory usage with top or a similar tool.

At the start of the first iteration, you'll see a memory usage of x percent. On the second iteration, after loc was called for the first time, memory usage doubles to 2x. Subsequently, you'll see it go up to 4x during each call to loc, then go down to 2x after garbage collection.


Use DataFrame.drop with inplace=True if you want to destructively mutate A without copying a large subset of A's data.

places_to_drop = ~(A.coordinate - one_center).between(-exclusion, exclusion)A.drop(A.index[np.where(places_to_drop)], inplace=True) 

The place where the original usage of loc ultimately bottoms out is in the _NDFrameIndexer method _getitem_iterable. _LocIndexer is a child class of _NDFrameIndexer and an instance of _LocIndexer gets created and populates the loc property of a DataFrame.

In particular, _getitem_iterable performs a check for a Boolean index, which occurs in your case. Then a new array of the Boolean locations is created (which is wasteful of memory when key is already in Boolean format).

inds, = key.nonzero()

and then finally the "true" locations are returned within a copy:

return self.obj.take(inds, axis=axis, convert=False)

From the code: key will be your Boolean index (i.e. the result of the expression (A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)) and self.obj will be the parent DataFrame instance from which loc was invoked, so obj here is just A.

The DataFrame.take documentation explains that the default behavior is to make a copy. In the current implementation of the indexers, there is no mechanism allowing you to pass a keyword argument through that will eventually be used to perform take without making a copy.

On any reasonable modern machine, using the drop approach should be problem-free for the size of data you describe, so it is not the case that A's size is to blame.