How to query MultiIndex index columns values in pandas How to query MultiIndex index columns values in pandas python python

How to query MultiIndex index columns values in pandas


To query the df by the MultiIndex values, for example where (A > 1.7) and (B < 666):

In [536]: result_df = df.loc[(df.index.get_level_values('A') > 1.7) & (df.index.get_level_values('B') < 666)]In [537]: result_dfOut[537]:           CA   B      3.3 222  43    333  595.5 333  56

Hence, to get for example the 'A' index values, if still required:

In [538]: result_df.index.get_level_values('A')Out[538]: Index([3.3, 3.3, 5.5], dtype=object)

The problem is, that in large data frames the performance of by index selection worse by 10% than the sorted regular rows selection. And in repetitive work, looping, the delay accumulated. See example:

In [558]: df = store.select(STORE_EXTENT_BURSTS_DF_KEY)In [559]: len(df)Out[559]: 12857In [560]: df.sort(inplace=True)In [561]: df_without_index = df.reset_index()In [562]: %timeit df.loc[(df.index.get_level_values('END_TIME') > 358200) & (df.index.get_level_values('START_TIME') < 361680)]1000 loops, best of 3: 562 µs per loopIn [563]: %timeit df_without_index[(df_without_index.END_TIME > 358200) & (df_without_index.START_TIME < 361680)]1000 loops, best of 3: 507 µs per loop


For better readability, we can simply use the query() Method, to avoid the lengthy df.index.get_level_values() and reset_index/set_index to and fro.

Here is the target DataFrame:

In [12]: df                                                                    Out[12]:                                                                                 C                                                                    A   B                                                                          1.1 111  68                                                                        222  40                                                                    3.3 222  20                                                                        333  11                                                                    5.5 333  80                                                                    6.6 777  51 

Answer for Q1 (A in range [3.3, 6.6]):

In [13]: df.query('3.3 <= A <= 6.6') # for closed interval                       Out[13]:                                                                                 C                                                                    A   B                                                                          3.3 222  20                                                                        333  11                                                                    5.5 333  80                                                                    6.6 777  51                                                                    In [14]: df.query('3.3 < A < 6.6') # for open interval                         Out[14]:                                                                                 C                                                                    A   B                                                                          5.5 333  80

and of course one can play around with <, <=, >, >= for any kind of inclusion.


Similarly, answer for Q2 (A in range [2.0, 4.0]):

In [15]: df.query('2.0 <= A <= 4.0')                                        Out[15]:                                                                              C                                                                 A   B                                                                       3.3 222  20                                                                     333  11 

Answer for Q3 (B in range [111, 500]):

In [16]: df.query('111 <= B <= 500')                                        Out[16]:                                                                              C                                                                 A   B                                                                       1.1 111  68                                                                     222  40                                                                 3.3 222  20                                                                     333  11                                                                 5.5 333  80

And moreover, you can COMBINE the query for col A and B very naturally!

In [17]: df.query('0 < A < 4 and 150 < B < 400')                            Out[17]:                                                                              C                                                                 A   B                                                                       1.1 222  40                                                                 3.3 222  20                                                                     333  11


With a 'float' like index you always want to use it as a column rather than a direct indexing action. These will all work whether the endpoints exist or not.

In [11]: dfOut[11]:           CA   B      1.1 111  81    222  453.3 222  98    333  135.5 333  896.6 777  98In [12]: x = df.reset_index()

Q1

In [13]: x.loc[(x.A>=3.3)&(x.A<=6.6)]Out[13]:      A    B   C2  3.3  222  983  3.3  333  134  5.5  333  895  6.6  777  98

Q2

In [14]: x.loc[(x.A>=2.0)&(x.A<=4.0)]Out[14]:      A    B   C2  3.3  222  983  3.3  333  13

Q3

In [15]: x.loc[(x.B>=111.0)&(x.B<=500.0)]Out[15]:      A    B   C0  1.1  111  811  1.1  222  452  3.3  222  983  3.3  333  134  5.5  333  89

If you want the indices back, just set them. This is a cheap operation.

In [16]: x.loc[(x.B>=111.0)&(x.B<=500.0)].set_index(['A','B'])Out[16]:           CA   B      1.1 111  81    222  453.3 222  98    333  135.5 333  89

If you REALLY want the actual index values

In [5]: x.loc[(x.B>=111.0)&(x.B<=500.0)].set_index(['A','B']).indexOut[5]: MultiIndex[(1.1, 111), (1.1, 222), (3.3, 222), (3.3, 333), (5.5, 333)]