Fast pandas filtering

python pandas

Try using isin (thanks to DSM for suggesting loc over ix here):

In [78]: x = pd.DataFrame([['sam',328],['ruby',3213],['jon',121]], columns = ['name', 'score'])In [79]: names = ['sam', 'ruby']In [80]: x['name'].isin(names)Out[80]: 0     True1     True2    FalseName: name, dtype: boolIn [81]: x.loc[x['name'].isin(names), 'score'].sum()Out[81]: 3541

CT Zhu suggests a faster alternative using np.in1d:

In [105]: y = pd.concat([x]*1000)In [109]: %timeit y.loc[y['name'].isin(names), 'score'].sum()1000 loops, best of 3: 413 µs per loopIn [110]: %timeit y.loc[np.in1d(y['name'], names), 'score'].sum()1000 loops, best of 3: 335 µs per loop

python pandas

If I need to search on a field, I have noticed that it helps immensely if I change the index of the DataFrame to the search field. For one of my search and lookup requirements I got a performance improvement of around 500%.

So in your case the following could be used to search and filter by name.

df = pd.DataFrame([['sam', 328], ['ruby', 3213], ['jon', 121]],                  columns=['name', 'score'])names = ['sam', 'ruby']df_searchable = df.set_index('name')df_searchable[df_searchable.index.isin(names)]

python pandas

If your data repeats a lot of values, try using the 'categorical' data type for that column and then applying boolean filtering. Much more flexible than using indices and, at least in my case, much faster.

data = pd.read_csv('data.csv', dtype={'name':'category'})data[(data.name=='sam')&(data.score>1)]

names=['sam','ruby']    data[data.name.isin(names)]

For the ~15 million row, ~200k unique terms dataset I'm working with in pandas 1.2, %timeit results are:

boolean filter on object column: 608ms
.loc filter on same object column as index: 281ms
boolean filter on same object column as 'categorical' type: 16ms

From there, add the .sum() or whatever aggregation function you're looking for.

CodeHunter

Fast pandas filtering

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last