How to filter Pandas dataframe using 'in' and 'not in' like in SQL

You can use pd.Series.isin.

For "IN" use: something.isin(somewhere)

Or for "NOT IN": ~something.isin(somewhere)

As a worked example:

import pandas as pd>>> df  country0        US1        UK2   Germany3     China>>> countries_to_keep['UK', 'China']>>> df.country.isin(countries_to_keep)0    False1     True2    False3     TrueName: country, dtype: bool>>> df[df.country.isin(countries_to_keep)]  country1        UK3     China>>> df[~df.country.isin(countries_to_keep)]  country0        US2   Germany

python pandas dataframe sql-function

Alternative solution that uses .query() method:

In [5]: df.query("countries in @countries_to_keep")Out[5]:  countries1        UK3     ChinaIn [6]: df.query("countries not in @countries_to_keep")Out[6]:  countries0        US2   Germany

python pandas dataframe sql-function

How to implement 'in' and 'not in' for a pandas DataFrame?

Pandas offers two methods: Series.isin and DataFrame.isin for Series and DataFrames, respectively.

Filter DataFrame Based on ONE Column (also applies to Series)

The most common scenario is applying an isin condition on a specific column to filter rows in a DataFrame.

df = pd.DataFrame({'countries': ['US', 'UK', 'Germany', np.nan, 'China']})df  countries0        US1        UK2   Germany3     Chinac1 = ['UK', 'China']             # listc2 = {'Germany'}                 # setc3 = pd.Series(['China', 'US'])  # Seriesc4 = np.array(['US', 'UK'])      # array

Series.isin accepts various types as inputs. The following are all valid ways of getting what you want:

df['countries'].isin(c1)0    False1     True2    False3    False4     TrueName: countries, dtype: bool# `in` operationdf[df['countries'].isin(c1)]  countries1        UK4     China# `not in` operationdf[~df['countries'].isin(c1)]  countries0        US2   Germany3       NaN

# Filter with `set` (tuples work too)df[df['countries'].isin(c2)]  countries2   Germany

# Filter with another Seriesdf[df['countries'].isin(c3)]  countries0        US4     China

# Filter with arraydf[df['countries'].isin(c4)]  countries0        US1        UK

Filter on MANY Columns

Sometimes, you will want to apply an 'in' membership check with some search terms over multiple columns,

df2 = pd.DataFrame({    'A': ['x', 'y', 'z', 'q'], 'B': ['w', 'a', np.nan, 'x'], 'C': np.arange(4)})df2   A    B  C0  x    w  01  y    a  12  z  NaN  23  q    x  3c1 = ['x', 'w', 'p']

To apply the isin condition to both columns "A" and "B", use DataFrame.isin:

df2[['A', 'B']].isin(c1)      A      B0   True   True1  False  False2  False  False3  False   True

From this, to retain rows where at least one column is True, we can use any along the first axis:

df2[['A', 'B']].isin(c1).any(axis=1)0     True1    False2    False3     Truedtype: booldf2[df2[['A', 'B']].isin(c1).any(axis=1)]   A  B  C0  x  w  03  q  x  3

Note that if you want to search every column, you'd just omit the column selection step and do

df2.isin(c1).any(axis=1)

Similarly, to retain rows where ALL columns are True, use all in the same manner as before.

df2[df2[['A', 'B']].isin(c1).all(axis=1)]   A  B  C0  x  w  0

Notable Mentions: `numpy.isin`, `query`, list comprehensions (string data)

In addition to the methods described above, you can also use the numpy equivalent: numpy.isin.

# `in` operationdf[np.isin(df['countries'], c1)]  countries1        UK4     China# `not in` operationdf[np.isin(df['countries'], c1, invert=True)]  countries0        US2   Germany3       NaN

Why is it worth considering? NumPy functions are usually a bit faster than their pandas equivalents because of lower overhead. Since this is an elementwise operation that does not depend on index alignment, there are very few situations where this method is not an appropriate replacement for pandas' isin.

Pandas routines are usually iterative when working with strings, because string operations are hard to vectorise. There is a lot of evidence to suggest that list comprehensions will be faster here..We resort to an in check now.

c1_set = set(c1) # Using `in` with `sets` is a constant time operation...                  # This doesn't matter for pandas because the implementation differs.# `in` operationdf[[x in c1_set for x in df['countries']]]  countries1        UK4     China# `not in` operationdf[[x not in c1_set for x in df['countries']]]  countries0        US2   Germany3       NaN

It is a lot more unwieldy to specify, however, so don't use it unless you know what you're doing.

Lastly, there's also DataFrame.query which has been covered in this answer. numexpr FTW!

CodeHunter

How to filter Pandas dataframe using 'in' and 'not in' like in SQL

How to implement 'in' and 'not in' for a pandas DataFrame?

Filter DataFrame Based on ONE Column (also applies to Series)

Filter on MANY Columns

Notable Mentions: `numpy.isin`, `query`, list comprehensions (string data)

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last

How to filter Pandas dataframe using 'in' and 'not in' like in SQL

How to implement 'in' and 'not in' for a pandas DataFrame?

Filter DataFrame Based on ONE Column (also applies to Series)

Filter on MANY Columns

Notable Mentions: numpy.isin, query, list comprehensions (string data)

Recent Posts

Notable Mentions: `numpy.isin`, `query`, list comprehensions (string data)