How to query MultiIndex index columns values in pandas
To query the df by the MultiIndex values, for example where (A > 1.7) and (B < 666):
In [536]: result_df = df.loc[(df.index.get_level_values('A') > 1.7) & (df.index.get_level_values('B') < 666)]In [537]: result_dfOut[537]: CA B 3.3 222 43 333 595.5 333 56
Hence, to get for example the 'A' index values, if still required:
In [538]: result_df.index.get_level_values('A')Out[538]: Index([3.3, 3.3, 5.5], dtype=object)
The problem is, that in large data frames the performance of by index selection worse by 10% than the sorted regular rows selection. And in repetitive work, looping, the delay accumulated. See example:
In [558]: df = store.select(STORE_EXTENT_BURSTS_DF_KEY)In [559]: len(df)Out[559]: 12857In [560]: df.sort(inplace=True)In [561]: df_without_index = df.reset_index()In [562]: %timeit df.loc[(df.index.get_level_values('END_TIME') > 358200) & (df.index.get_level_values('START_TIME') < 361680)]1000 loops, best of 3: 562 µs per loopIn [563]: %timeit df_without_index[(df_without_index.END_TIME > 358200) & (df_without_index.START_TIME < 361680)]1000 loops, best of 3: 507 µs per loop
For better readability, we can simply use the query()
Method, to avoid the lengthy df.index.get_level_values()
and reset_index
/set_index
to and fro.
Here is the target DataFrame
:
In [12]: df Out[12]: C A B 1.1 111 68 222 40 3.3 222 20 333 11 5.5 333 80 6.6 777 51
Answer for Q1 (A
in range [3.3, 6.6]
):
In [13]: df.query('3.3 <= A <= 6.6') # for closed interval Out[13]: C A B 3.3 222 20 333 11 5.5 333 80 6.6 777 51 In [14]: df.query('3.3 < A < 6.6') # for open interval Out[14]: C A B 5.5 333 80
and of course one can play around with <, <=, >, >=
for any kind of inclusion.
Similarly, answer for Q2 (A
in range [2.0, 4.0]
):
In [15]: df.query('2.0 <= A <= 4.0') Out[15]: C A B 3.3 222 20 333 11
Answer for Q3 (B
in range [111, 500]
):
In [16]: df.query('111 <= B <= 500') Out[16]: C A B 1.1 111 68 222 40 3.3 222 20 333 11 5.5 333 80
And moreover, you can COMBINE the query for col A
and B
very naturally!
In [17]: df.query('0 < A < 4 and 150 < B < 400') Out[17]: C A B 1.1 222 40 3.3 222 20 333 11
With a 'float' like index you always want to use it as a column rather than a direct indexing action. These will all work whether the endpoints exist or not.
In [11]: dfOut[11]: CA B 1.1 111 81 222 453.3 222 98 333 135.5 333 896.6 777 98In [12]: x = df.reset_index()
Q1
In [13]: x.loc[(x.A>=3.3)&(x.A<=6.6)]Out[13]: A B C2 3.3 222 983 3.3 333 134 5.5 333 895 6.6 777 98
Q2
In [14]: x.loc[(x.A>=2.0)&(x.A<=4.0)]Out[14]: A B C2 3.3 222 983 3.3 333 13
Q3
In [15]: x.loc[(x.B>=111.0)&(x.B<=500.0)]Out[15]: A B C0 1.1 111 811 1.1 222 452 3.3 222 983 3.3 333 134 5.5 333 89
If you want the indices back, just set them. This is a cheap operation.
In [16]: x.loc[(x.B>=111.0)&(x.B<=500.0)].set_index(['A','B'])Out[16]: CA B 1.1 111 81 222 453.3 222 98 333 135.5 333 89
If you REALLY want the actual index values
In [5]: x.loc[(x.B>=111.0)&(x.B<=500.0)].set_index(['A','B']).indexOut[5]: MultiIndex[(1.1, 111), (1.1, 222), (3.3, 222), (3.3, 333), (5.5, 333)]