Efficient way to apply multiple filters to pandas DataFrame or Series Efficient way to apply multiple filters to pandas DataFrame or Series python python

Efficient way to apply multiple filters to pandas DataFrame or Series


Pandas (and numpy) allow for boolean indexing, which will be much more efficient:

In [11]: df.loc[df['col1'] >= 1, 'col1']Out[11]: 1    12    2Name: col1In [12]: df[df['col1'] >= 1]Out[12]:    col1  col21     1    112     2    12In [13]: df[(df['col1'] >= 1) & (df['col1'] <=1 )]Out[13]:    col1  col21     1    11

If you want to write helper functions for this, consider something along these lines:

In [14]: def b(x, col, op, n):              return op(x[col],n)In [15]: def f(x, *b):             return x[(np.logical_and(*b))]In [16]: b1 = b(df, 'col1', ge, 1)In [17]: b2 = b(df, 'col1', le, 1)In [18]: f(df, b1, b2)Out[18]:    col1  col21     1    11

Update: pandas 0.13 has a query method for these kind of use cases, assuming column names are valid identifiers the following works (and can be more efficient for large frames as it uses numexpr behind the scenes):

In [21]: df.query('col1 <= 1 & 1 <= col1')Out[21]:   col1  col21     1    11


Chaining conditions creates long lines, which are discouraged by pep8.Using the .query method forces to use strings, which is powerful but unpythonic and not very dynamic.

Once each of the filters is in place, one approach is

import numpy as npimport functoolsdef conjunction(*conditions):    return functools.reduce(np.logical_and, conditions)c_1 = data.col1 == Truec_2 = data.col2 < 64c_3 = data.col3 != 4data_filtered = data[conjunction(c1,c2,c3)]

np.logical operates on and is fast, but does not take more than two arguments, which is handled by functools.reduce.

Note that this still has some redundancies: a) shortcutting does not happen on a global level b) Each of the individual conditions runs on the whole initial data. Still, I expect this to be efficient enough for many applications and it is very readable.

You can also make a disjunction (wherein only one of the conditions needs to be true) by using np.logical_or instead:

import numpy as npimport functoolsdef disjunction(*conditions):    return functools.reduce(np.logical_or, conditions)c_1 = data.col1 == Truec_2 = data.col2 < 64c_3 = data.col3 != 4data_filtered = data[disjunction(c_1,c_2,c_3)]


Simplest of All Solutions:

Use:

filtered_df = df[(df['col1'] >= 1) & (df['col1'] <= 5)]

Another Example, To filter the dataframe for values belonging to Feb-2018, use the below code

filtered_df = df[(df['year'] == 2018) & (df['month'] == 2)]