optimize pandas query on multiple columns / multiindex
So there are 2 issues here.
This is an artifice that makes the syntax a little nicer
In [111]: idx = pd.IndexSlice
1) Your .query
does not have the correct precedence. The &
operator has a higher precedence than comparison operators like <=
and needs parentheses around its left and right operands.
In [102]: result3 = mdt.query("(@test_A-@eps_A <= A <= @test_A+@eps_A) & (@test_B-@eps_B <= B <= @test_B+@eps_B) & (@test_C-@eps_C <= C <= @test_C+@eps_C) & (@test_D-@eps_D <= D <= @test_D+@eps_D)").set_index(['A','B','C','D']).sortlevel()
This is your original query using MultiIndex slicers
In [103]: result1 = mdt2.loc[idx[test_A-eps_A:test_A+eps_A,test_B-eps_B:test_B+eps_B,test_C-eps_C:test_C+eps_C,test_D-eps_D:test_D+eps_D],:]
Here is a chained version of this query. IOW its a repeated selection on the result set.
In [104]: result2 = mdt2.loc[idx[test_A-eps_A:test_A+eps_A],:].loc[idx[:,test_B-eps_B:test_B+eps_B],:].loc[idx[:,:,test_C-eps_C:test_C+eps_C],:].loc[idx[:,:,:,test_D-eps_D:test_D+eps_D],:]
Always confirm correctness before working on performance
In [109]: (result1==result2).all().all()Out[109]: TrueIn [110]: (result1==result3).all().all()Out[110]: True
Performance
The .query
IMHO will actually scale very well and uses multi-cores. For a large selection set this will be the way to go
In [107]: %timeit mdt.query("(@test_A-@eps_A <= A <= @test_A+@eps_A) & (@test_B-@eps_B <= B <= @test_B+@eps_B) & (@test_C-@eps_C <= C <= @test_C+@eps_C) & (@test_D-@eps_D <= D <= @test_D+@eps_D)").set_index(['A','B','C','D']).sortlevel()10 loops, best of 3: 107 ms per loop
2) The original multi-index slicing. There is an issues here, see below. I am not sure exactly why this is non-performant, and will investigate this here
In [106]: %timeit mdt2.loc[idx[test_A-eps_A:test_A+eps_A,test_B-eps_B:test_B+eps_B,test_C-eps_C:test_C+eps_C,test_D-eps_D:test_D+eps_D],:]1 loops, best of 3: 4.34 s per loop
Repeated selections make this quite performant. Note that I won't normally recommend one do this as you cannot assign to it, but for this purpose it is ok.
In [105]: %timeit mdt2.loc[idx[test_A-eps_A:test_A+eps_A],:].loc[idx[:,test_B-eps_B:test_B+eps_B],:].loc[idx[:,:,test_C-eps_C:test_C+eps_C],:].loc[idx[:,:,:,test_D-eps_D:test_D+eps_D],:]10 loops, best of 3: 140 ms per loop