selecting across multiple columns with python pandas? selecting across multiple columns with python pandas? numpy numpy

selecting across multiple columns with python pandas?


I encourage you to pose these questions on the mailing list, but in any case, it's still a very much low level affair working with the underlying NumPy arrays. For example, to select rows where the value in any column exceed, say, 1.5 in this example:

In [11]: dfOut[11]:             A        B        C        D      2000-01-03 -0.59885 -0.18141 -0.68828 -0.775722000-01-04  0.83935  0.15993  0.95911 -1.129592000-01-05  2.80215 -0.10858 -1.62114 -0.201702000-01-06  0.71670 -0.26707  1.36029  1.742542000-01-07 -0.45749  0.22750  0.46291 -0.584312000-01-10 -0.78702  0.44006 -0.36881 -0.138842000-01-11  0.79577 -0.09198  0.14119  0.026682000-01-12 -0.32297  0.62332  1.93595  0.780242000-01-13  1.74683 -1.57738 -0.02134  0.115962000-01-14 -0.55613  0.92145 -0.22832  1.566312000-01-17 -0.55233 -0.28859 -1.18190 -0.807232000-01-18  0.73274  0.24387  0.88146 -0.944902000-01-19  0.56644 -0.49321  1.17584 -0.175852000-01-20  1.56441  0.62331 -0.26904  0.119522000-01-21  0.61834  0.17463 -1.62439  0.991032000-01-24  0.86378 -0.68111 -0.15788 -0.166702000-01-25 -1.12230 -0.16128  1.20401  1.089452000-01-26 -0.63115  0.76077 -0.92795 -2.171182000-01-27  1.37620 -1.10618 -0.37411  0.737802000-01-28 -1.40276  1.98372  1.47096 -1.380432000-01-31  0.54769  0.44100 -0.52775  0.844972000-02-01  0.12443  0.32880 -0.71361  1.317782000-02-02 -0.28986 -0.63931  0.88333 -2.589432000-02-03  0.54408  1.17928 -0.26795 -0.516812000-02-04 -0.07068 -1.29168 -0.59877 -1.456392000-02-07 -0.65483 -0.29584 -0.02722  0.312702000-02-08 -0.18529 -0.18701 -0.59132 -1.152392000-02-09 -2.28496  0.36352  1.11596  0.022932000-02-10  0.51054  0.97249  1.74501  0.205252000-02-11  0.10100  0.27722  0.65843  1.73591In [12]: df[(df.values > 1.5).any(1)]Out[12]:             A       B       C        D     2000-01-05  2.8021 -0.1086 -1.62114 -0.20172000-01-06  0.7167 -0.2671  1.36029  1.74252000-01-12 -0.3230  0.6233  1.93595  0.78022000-01-13  1.7468 -1.5774 -0.02134  0.11602000-01-14 -0.5561  0.9215 -0.22832  1.56632000-01-20  1.5644  0.6233 -0.26904  0.11952000-01-28 -1.4028  1.9837  1.47096 -1.38042000-02-10  0.5105  0.9725  1.74501  0.20522000-02-11  0.1010  0.2772  0.65843  1.7359

Multiple conditions have to be combined using & or | (and parentheses!):

In [13]: df[(df['A'] > 1) | (df['B'] < -1)]Out[13]:             A        B       C        D     2000-01-05  2.80215 -0.1086 -1.62114 -0.20172000-01-13  1.74683 -1.5774 -0.02134  0.11602000-01-20  1.56441  0.6233 -0.26904  0.11952000-01-27  1.37620 -1.1062 -0.37411  0.73782000-02-04 -0.07068 -1.2917 -0.59877 -1.4564

I'd be very interested to have some kind of query API to make these kinds of things easier


There are at least a few approaches to shortening the syntax for this in Pandas, until it gets a full query API down the road (perhaps I'll try to join the github project and do this is time permits and if no one else already has started).

One method to shorten the syntax a little is given below:

inds = df.apply(lambda x: x["A"]>10 and x["B"]<5, axis=1) print df[inds].to_string()

To fully solve this, one would need to build something like the SQL select and where clauses into Pandas. This is not trivial at all, but one stab that I think might work for this is to use the Python operator built-in module. This allows you to treat things like greater-than as functions instead of symbols. So you could do the following:

def pandas_select(dataframe, select_dict):    inds = dataframe.apply(lambda x: reduce(lambda v1,v2: v1 and v2,                            [elem[0](x[key], elem[1])                            for key,elem in select_dict.iteritems()]), axis=1)    return dataframe[inds]

Then a test example like yours would be to do the following:

import operatorselect_dict = {               "A":(operator.gt,10),               "B":(operator.lt,5)                                }print pandas_select(df, select_dict).to_string()

You can shorten the syntax even further by either building in more arguments to pandas_select to handle the different common logical operators automatically, or by importing them into the namespace with shorter names.

Note that the pandas_select function above only works with logical-and chains of constraints. You'd have to modify it to get different logical behavior. Or use not and DeMorgan's Laws.


A query feature has been added to Pandas since this question was asked and answered. An example is given below.

Given this sample data frame:

periods = 8dates = pd.date_range('20170101', periods=periods)rand_df = pd.DataFrame(np.random.randn(periods,4), index=dates,       columns=list('ABCD'))

The query syntax as follows will allow you to use multiple filters, like a "WHERE" clause in a select statement.

rand_df.query("A < 0 or B < 0")

See the Pandas documentation for additional details.