optimize pandas query on multiple columns / multiindex

So there are 2 issues here.

This is an artifice that makes the syntax a little nicer

In [111]: idx = pd.IndexSlice

1) Your .query does not have the correct precedence. The & operator has a higher precedence than comparison operators like <= and needs parentheses around its left and right operands.

In [102]: result3 = mdt.query("(@test_A-@eps_A <= A <= @test_A+@eps_A) & (@test_B-@eps_B <= B <= @test_B+@eps_B) & (@test_C-@eps_C <= C <= @test_C+@eps_C) & (@test_D-@eps_D <= D <= @test_D+@eps_D)").set_index(['A','B','C','D']).sortlevel()

This is your original query using MultiIndex slicers

In [103]: result1 = mdt2.loc[idx[test_A-eps_A:test_A+eps_A,test_B-eps_B:test_B+eps_B,test_C-eps_C:test_C+eps_C,test_D-eps_D:test_D+eps_D],:]

Here is a chained version of this query. IOW its a repeated selection on the result set.

In [104]: result2 = mdt2.loc[idx[test_A-eps_A:test_A+eps_A],:].loc[idx[:,test_B-eps_B:test_B+eps_B],:].loc[idx[:,:,test_C-eps_C:test_C+eps_C],:].loc[idx[:,:,:,test_D-eps_D:test_D+eps_D],:]

Always confirm correctness before working on performance

In [109]: (result1==result2).all().all()Out[109]: TrueIn [110]: (result1==result3).all().all()Out[110]: True

Performance

The .query IMHO will actually scale very well and uses multi-cores. For a large selection set this will be the way to go

In [107]: %timeit mdt.query("(@test_A-@eps_A <= A <= @test_A+@eps_A) & (@test_B-@eps_B <= B <= @test_B+@eps_B) & (@test_C-@eps_C <= C <= @test_C+@eps_C) & (@test_D-@eps_D <= D <= @test_D+@eps_D)").set_index(['A','B','C','D']).sortlevel()10 loops, best of 3: 107 ms per loop

2) The original multi-index slicing. There is an issues here, see below. I am not sure exactly why this is non-performant, and will investigate this here

In [106]: %timeit  mdt2.loc[idx[test_A-eps_A:test_A+eps_A,test_B-eps_B:test_B+eps_B,test_C-eps_C:test_C+eps_C,test_D-eps_D:test_D+eps_D],:]1 loops, best of 3: 4.34 s per loop

Repeated selections make this quite performant. Note that I won't normally recommend one do this as you cannot assign to it, but for this purpose it is ok.

In [105]: %timeit mdt2.loc[idx[test_A-eps_A:test_A+eps_A],:].loc[idx[:,test_B-eps_B:test_B+eps_B],:].loc[idx[:,:,test_C-eps_C:test_C+eps_C],:].loc[idx[:,:,:,test_D-eps_D:test_D+eps_D],:]10 loops, best of 3: 140 ms per loop

CodeHunter

optimize pandas query on multiple columns / multiindex

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last