How to filter Pandas dataframe using 'in' and 'not in' like in SQL
You can use pd.Series.isin
.
For "IN" use: something.isin(somewhere)
Or for "NOT IN": ~something.isin(somewhere)
As a worked example:
import pandas as pd>>> df country0 US1 UK2 Germany3 China>>> countries_to_keep['UK', 'China']>>> df.country.isin(countries_to_keep)0 False1 True2 False3 TrueName: country, dtype: bool>>> df[df.country.isin(countries_to_keep)] country1 UK3 China>>> df[~df.country.isin(countries_to_keep)] country0 US2 Germany
Alternative solution that uses .query() method:
In [5]: df.query("countries in @countries_to_keep")Out[5]: countries1 UK3 ChinaIn [6]: df.query("countries not in @countries_to_keep")Out[6]: countries0 US2 Germany
How to implement 'in' and 'not in' for a pandas DataFrame?
Pandas offers two methods: Series.isin
and DataFrame.isin
for Series and DataFrames, respectively.
Filter DataFrame Based on ONE Column (also applies to Series)
The most common scenario is applying an isin
condition on a specific column to filter rows in a DataFrame.
df = pd.DataFrame({'countries': ['US', 'UK', 'Germany', np.nan, 'China']})df countries0 US1 UK2 Germany3 Chinac1 = ['UK', 'China'] # listc2 = {'Germany'} # setc3 = pd.Series(['China', 'US']) # Seriesc4 = np.array(['US', 'UK']) # array
Series.isin
accepts various types as inputs. The following are all valid ways of getting what you want:
df['countries'].isin(c1)0 False1 True2 False3 False4 TrueName: countries, dtype: bool# `in` operationdf[df['countries'].isin(c1)] countries1 UK4 China# `not in` operationdf[~df['countries'].isin(c1)] countries0 US2 Germany3 NaN
# Filter with `set` (tuples work too)df[df['countries'].isin(c2)] countries2 Germany
# Filter with another Seriesdf[df['countries'].isin(c3)] countries0 US4 China
# Filter with arraydf[df['countries'].isin(c4)] countries0 US1 UK
Filter on MANY Columns
Sometimes, you will want to apply an 'in' membership check with some search terms over multiple columns,
df2 = pd.DataFrame({ 'A': ['x', 'y', 'z', 'q'], 'B': ['w', 'a', np.nan, 'x'], 'C': np.arange(4)})df2 A B C0 x w 01 y a 12 z NaN 23 q x 3c1 = ['x', 'w', 'p']
To apply the isin
condition to both columns "A" and "B", use DataFrame.isin
:
df2[['A', 'B']].isin(c1) A B0 True True1 False False2 False False3 False True
From this, to retain rows where at least one column is True
, we can use any
along the first axis:
df2[['A', 'B']].isin(c1).any(axis=1)0 True1 False2 False3 Truedtype: booldf2[df2[['A', 'B']].isin(c1).any(axis=1)] A B C0 x w 03 q x 3
Note that if you want to search every column, you'd just omit the column selection step and do
df2.isin(c1).any(axis=1)
Similarly, to retain rows where ALL columns are True
, use all
in the same manner as before.
df2[df2[['A', 'B']].isin(c1).all(axis=1)] A B C0 x w 0
Notable Mentions: numpy.isin
, query
, list comprehensions (string data)
In addition to the methods described above, you can also use the numpy equivalent: numpy.isin
.
# `in` operationdf[np.isin(df['countries'], c1)] countries1 UK4 China# `not in` operationdf[np.isin(df['countries'], c1, invert=True)] countries0 US2 Germany3 NaN
Why is it worth considering? NumPy functions are usually a bit faster than their pandas equivalents because of lower overhead. Since this is an elementwise operation that does not depend on index alignment, there are very few situations where this method is not an appropriate replacement for pandas' isin
.
Pandas routines are usually iterative when working with strings, because string operations are hard to vectorise. There is a lot of evidence to suggest that list comprehensions will be faster here..We resort to an in
check now.
c1_set = set(c1) # Using `in` with `sets` is a constant time operation... # This doesn't matter for pandas because the implementation differs.# `in` operationdf[[x in c1_set for x in df['countries']]] countries1 UK4 China# `not in` operationdf[[x not in c1_set for x in df['countries']]] countries0 US2 Germany3 NaN
It is a lot more unwieldy to specify, however, so don't use it unless you know what you're doing.
Lastly, there's also DataFrame.query
which has been covered in this answer. numexpr FTW!