subsetting a Python DataFrame
I'll assume that Time
and Product
are columns in a DataFrame
, df
is an instance of DataFrame
, and that other variables are scalar values:
For now, you'll have to reference the DataFrame
instance:
k1 = df.loc[(df.Product == p_id) & (df.Time >= start_time) & (df.Time < end_time), ['Time', 'Product']]
The parentheses are also necessary, because of the precedence of the &
operator vs. the comparison operators. The &
operator is actually an overloaded bitwise operator which has the same precedence as arithmetic operators which in turn have a higher precedence than comparison operators.
In pandas
0.13 a new experimental DataFrame.query()
method will be available. It's extremely similar to subset modulo the select
argument:
With query()
you'd do it like this:
df[['Time', 'Product']].query('Product == p_id and Month < mn and Year == yr')
Here's a simple example:
In [9]: df = DataFrame({'gender': np.random.choice(['m', 'f'], size=10), 'price': poisson(100, size=10)})In [10]: dfOut[10]: gender price0 m 891 f 1232 f 1003 m 1044 m 985 m 1036 f 1007 f 1098 f 959 m 87In [11]: df.query('gender == "m" and price < 100')Out[11]: gender price0 m 894 m 989 m 87
The final query that you're interested will even be able to take advantage of chained comparisons, like this:
k1 = df[['Time', 'Product']].query('Product == p_id and start_time <= Time < end_time')
I've found that you can use any subset condition for a given column by wrapping it in []. For instance, you have a df with columns ['Product','Time', 'Year', 'Color']
And let's say you want to include products made before 2014. You could write,
df[df['Year'] < 2014]
To return all the rows where this is the case. You can add different conditions.
df[df['Year'] < 2014][df['Color' == 'Red']
Then just choose the columns you want as directed above. For instance, the product color and key for the df above,
df[df['Year'] < 2014][df['Color'] == 'Red'][['Product','Color']]