How to delete rows from a pandas DataFrame based on a conditional expression [duplicate] How to delete rows from a pandas DataFrame based on a conditional expression [duplicate] python python

How to delete rows from a pandas DataFrame based on a conditional expression [duplicate]


To directly answer this question's original title "How to delete rows from a pandas DataFrame based on a conditional expression" (which I understand is not necessarily the OP's problem but could help other users coming across this question) one way to do this is to use the drop method:

df = df.drop(some labels)df = df.drop(df[<some boolean condition>].index)

Example

To remove all rows where column 'score' is < 50:

df = df.drop(df[df.score < 50].index)

In place version (as pointed out in comments)

df.drop(df[df.score < 50].index, inplace=True)

Multiple conditions

(see Boolean Indexing)

The operators are: | for or, & for and, and ~ for not. These must begrouped by using parentheses.

To remove all rows where column 'score' is < 50 and > 20

df = df.drop(df[(df.score < 50) & (df.score > 20)].index)


When you do len(df['column name']) you are just getting one number, namely the number of rows in the DataFrame (i.e., the length of the column itself). If you want to apply len to each element in the column, use df['column name'].map(len). So try

df[df['column name'].map(len) < 2]


You can assign the DataFrame to a filtered version of itself:

df = df[df.score > 50]

This is faster than drop:

%%timeittest = pd.DataFrame({'x': np.random.randn(int(1e6))})test = test[test.x < 0]# 54.5 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)%%timeittest = pd.DataFrame({'x': np.random.randn(int(1e6))})test.drop(test[test.x > 0].index, inplace=True)# 201 ms ± 17.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)%%timeittest = pd.DataFrame({'x': np.random.randn(int(1e6))})test = test.drop(test[test.x > 0].index)# 194 ms ± 7.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)