selecting across multiple columns with python pandas?

python csv numpy tab-delimited pandas

I encourage you to pose these questions on the mailing list, but in any case, it's still a very much low level affair working with the underlying NumPy arrays. For example, to select rows where the value in any column exceed, say, 1.5 in this example:

In [11]: dfOut[11]:             A        B        C        D      2000-01-03 -0.59885 -0.18141 -0.68828 -0.775722000-01-04  0.83935  0.15993  0.95911 -1.129592000-01-05  2.80215 -0.10858 -1.62114 -0.201702000-01-06  0.71670 -0.26707  1.36029  1.742542000-01-07 -0.45749  0.22750  0.46291 -0.584312000-01-10 -0.78702  0.44006 -0.36881 -0.138842000-01-11  0.79577 -0.09198  0.14119  0.026682000-01-12 -0.32297  0.62332  1.93595  0.780242000-01-13  1.74683 -1.57738 -0.02134  0.115962000-01-14 -0.55613  0.92145 -0.22832  1.566312000-01-17 -0.55233 -0.28859 -1.18190 -0.807232000-01-18  0.73274  0.24387  0.88146 -0.944902000-01-19  0.56644 -0.49321  1.17584 -0.175852000-01-20  1.56441  0.62331 -0.26904  0.119522000-01-21  0.61834  0.17463 -1.62439  0.991032000-01-24  0.86378 -0.68111 -0.15788 -0.166702000-01-25 -1.12230 -0.16128  1.20401  1.089452000-01-26 -0.63115  0.76077 -0.92795 -2.171182000-01-27  1.37620 -1.10618 -0.37411  0.737802000-01-28 -1.40276  1.98372  1.47096 -1.380432000-01-31  0.54769  0.44100 -0.52775  0.844972000-02-01  0.12443  0.32880 -0.71361  1.317782000-02-02 -0.28986 -0.63931  0.88333 -2.589432000-02-03  0.54408  1.17928 -0.26795 -0.516812000-02-04 -0.07068 -1.29168 -0.59877 -1.456392000-02-07 -0.65483 -0.29584 -0.02722  0.312702000-02-08 -0.18529 -0.18701 -0.59132 -1.152392000-02-09 -2.28496  0.36352  1.11596  0.022932000-02-10  0.51054  0.97249  1.74501  0.205252000-02-11  0.10100  0.27722  0.65843  1.73591In [12]: df[(df.values > 1.5).any(1)]Out[12]:             A       B       C        D     2000-01-05  2.8021 -0.1086 -1.62114 -0.20172000-01-06  0.7167 -0.2671  1.36029  1.74252000-01-12 -0.3230  0.6233  1.93595  0.78022000-01-13  1.7468 -1.5774 -0.02134  0.11602000-01-14 -0.5561  0.9215 -0.22832  1.56632000-01-20  1.5644  0.6233 -0.26904  0.11952000-01-28 -1.4028  1.9837  1.47096 -1.38042000-02-10  0.5105  0.9725  1.74501  0.20522000-02-11  0.1010  0.2772  0.65843  1.7359

Multiple conditions have to be combined using & or | (and parentheses!):

In [13]: df[(df['A'] > 1) | (df['B'] < -1)]Out[13]:             A        B       C        D     2000-01-05  2.80215 -0.1086 -1.62114 -0.20172000-01-13  1.74683 -1.5774 -0.02134  0.11602000-01-20  1.56441  0.6233 -0.26904  0.11952000-01-27  1.37620 -1.1062 -0.37411  0.73782000-02-04 -0.07068 -1.2917 -0.59877 -1.4564

I'd be very interested to have some kind of query API to make these kinds of things easier

python csv numpy tab-delimited pandas

There are at least a few approaches to shortening the syntax for this in Pandas, until it gets a full query API down the road (perhaps I'll try to join the github project and do this is time permits and if no one else already has started).

One method to shorten the syntax a little is given below:

inds = df.apply(lambda x: x["A"]>10 and x["B"]<5, axis=1) print df[inds].to_string()

To fully solve this, one would need to build something like the SQL select and where clauses into Pandas. This is not trivial at all, but one stab that I think might work for this is to use the Python operator built-in module. This allows you to treat things like greater-than as functions instead of symbols. So you could do the following:

def pandas_select(dataframe, select_dict):    inds = dataframe.apply(lambda x: reduce(lambda v1,v2: v1 and v2,                            [elem[0](x[key], elem[1])                            for key,elem in select_dict.iteritems()]), axis=1)    return dataframe[inds]

Then a test example like yours would be to do the following:

import operatorselect_dict = {               "A":(operator.gt,10),               "B":(operator.lt,5)                                }print pandas_select(df, select_dict).to_string()

You can shorten the syntax even further by either building in more arguments to pandas_select to handle the different common logical operators automatically, or by importing them into the namespace with shorter names.

Note that the pandas_select function above only works with logical-and chains of constraints. You'd have to modify it to get different logical behavior. Or use not and DeMorgan's Laws.

python csv numpy tab-delimited pandas

A query feature has been added to Pandas since this question was asked and answered. An example is given below.

Given this sample data frame:

periods = 8dates = pd.date_range('20170101', periods=periods)rand_df = pd.DataFrame(np.random.randn(periods,4), index=dates,       columns=list('ABCD'))

The query syntax as follows will allow you to use multiple filters, like a "WHERE" clause in a select statement.

rand_df.query("A < 0 or B < 0")

See the Pandas documentation for additional details.

CodeHunter

selecting across multiple columns with python pandas?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last