Selecting with complex criteria from pandas.DataFrame Selecting with complex criteria from pandas.DataFrame python python

Selecting with complex criteria from pandas.DataFrame


Sure! Setup:

>>> import pandas as pd>>> from random import randint>>> df = pd.DataFrame({'A': [randint(1, 9) for x in range(10)],                   'B': [randint(1, 9)*10 for x in range(10)],                   'C': [randint(1, 9)*100 for x in range(10)]})>>> df   A   B    C0  9  40  3001  9  70  7002  5  70  9003  8  80  9004  7  50  2005  9  30  9006  2  80  7007  2  80  4008  5  80  3009  7  70  800

We can apply column operations and get boolean Series objects:

>>> df["B"] > 500    False1     True2     True3     True4    False5    False6     True7     True8     True9     TrueName: B>>> (df["B"] > 50) & (df["C"] == 900)0    False1    False2     True3     True4    False5    False6    False7    False8    False9    False

[Update, to switch to new-style .loc]:

And then we can use these to index into the object. For read access, you can chain indices:

>>> df["A"][(df["B"] > 50) & (df["C"] == 900)]2    53    8Name: A, dtype: int64

but you can get yourself into trouble because of the difference between a view and a copy doing this for write access. You can use .loc instead:

>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"]2    53    8Name: A, dtype: int64>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"].valuesarray([5, 8], dtype=int64)>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"] *= 1000>>> df      A   B    C0     9  40  3001     9  70  7002  5000  70  9003  8000  80  9004     7  50  2005     9  30  9006     2  80  7007     2  80  4008     5  80  3009     7  70  800

Note that I accidentally typed == 900 and not != 900, or ~(df["C"] == 900), but I'm too lazy to fix it. Exercise for the reader. :^)


Another solution is to use the query method:

import pandas as pdfrom random import randintdf = pd.DataFrame({'A': [randint(1, 9) for x in xrange(10)],                   'B': [randint(1, 9) * 10 for x in xrange(10)],                   'C': [randint(1, 9) * 100 for x in xrange(10)]})print df   A   B    C0  7  20  3001  7  80  7002  4  90  1003  4  30  9004  7  80  2005  7  60  8006  3  80  9007  9  40  1008  6  40  1009  3  10  600print df.query('B > 50 and C != 900')   A   B    C1  7  80  7002  4  90  1004  7  80  2005  7  60  800

Now if you want to change the returned values in column A you can save their index:

my_query_index = df.query('B > 50 & C != 900').index

....and use .iloc to change them i.e:

df.iloc[my_query_index, 0] = 5000print df      A   B    C0     7  20  3001  5000  80  7002  5000  90  1003     4  30  9004  5000  80  2005  5000  60  8006     3  80  9007     9  40  1008     6  40  1009     3  10  600


And remember to use parenthesis!

Keep in mind that & operator takes a precedence over operators such as > or < etc. That is why

4 < 5 & 6 > 4

evaluates to False. Therefore if you're using pd.loc, you need to put brackets around your logical statements, otherwise you get an error. That's why do:

df.loc[(df['A'] > 10) & (df['B'] < 15)]

instead of

df.loc[df['A'] > 10 & df['B'] < 15]

which would result in

TypeError: cannot compare a dtyped [float64] array with a scalar of type [bool]