Pandas mask / where methods versus NumPy np.where
I'm using pandas 0.23.3 and Python 3.6, so I can see a real difference in running time only for your second example.
But let's investigate a slightly different version of your second example (so we get2*df[0]
out of the way). Here is our baseline on my machine:
twice = df[0]*2mask = df[0] > 0.5%timeit np.where(mask, twice, df[0]) # 61.4 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)%timeit df[0].mask(mask, twice)# 143 ms ± 5.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numpy's version is about 2.3 times faster than pandas.
So let's profile both functions to see the difference - profiling is a good way to get the big picture when one isn't very familiar with the code basis: it is faster than debugging and less error-prone than trying to figure out what's going on just by reading the code.
I'm on Linux and use perf
. For the numpy's version we get (for the listing see appendix A):
>>> perf record python np_where.py>>> perf reportOverhead Command Shared Object Symbol 68,50% python multiarray.cpython-36m-x86_64-linux-gnu.so [.] PyArray_Where 8,96% python [unknown] [k] 0xffffffff8140290c 1,57% python mtrand.cpython-36m-x86_64-linux-gnu.so [.] rk_random
As we can see, the lion's share of the time is spent in PyArray_Where
- about 69%. The unknown symbol is a kernel function (as matter of fact clear_page
) - I run without root privileges so the symbol is not resolved.
And for pandas we get (see Appendix B for code):
>>> perf record python pd_mask.py>>> perf reportOverhead Command Shared Object Symbol 37,12% python interpreter.cpython-36m-x86_64-linux-gnu.so [.] vm_engine_iter_task 23,36% python libc-2.23.so [.] __memmove_ssse3_back 19,78% python [unknown] [k] 0xffffffff8140290c 3,32% python umath.cpython-36m-x86_64-linux-gnu.so [.] DOUBLE_isnan 1,48% python umath.cpython-36m-x86_64-linux-gnu.so [.] BOOL_logical_not
Quite a different situation:
- pandas doesn't use
PyArray_Where
under the hood - the most prominent time-consumer isvm_engine_iter_task
, which is numexpr-functionality. - there is some heavy memory-copying going on -
__memmove_ssse3_back
uses about25
% of time! Probably some of the kernel's functions are also connected to memory-accesses.
Actually, pandas-0.19 used PyArray_Where
under the hood, for the older version the perf-report would look like:
Overhead Command Shared Object Symbol 32,42% python multiarray.so [.] PyArray_Where 30,25% python libc-2.23.so [.] __memmove_ssse3_back 21,31% python [kernel.kallsyms] [k] clear_page 1,72% python [kernel.kallsyms] [k] __schedule
So basically it would use np.where
under the hood + some overhead (all above data-copying, see __memmove_ssse3_back
) back then.
I see no scenario where pandas could become faster than numpy in pandas' version 0.19 - it just adds overhead to numpy's functionality. Pandas' version 0.23.3 is an entirely different story - here numexpr-module is used, it is very possible that there are scenarios for which pandas' version is (at least slightly) faster.
I'm not sure this memory-copying is really called for/necessary - maybe one even could call it performance-bug, but I just don't know enough to be certain.
We could help pandas not to copy, by peeling away some indirections (passing np.array
instead of pd.Series
). For example:
%timeit df[0].mask(mask.values > 0.5, twice.values)# 75.7 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now, pandas is only 25% slower. The perf says:
Overhead Command Shared Object Symbol 50,81% python interpreter.cpython-36m-x86_64-linux-gnu.so [.] vm_engine_iter_task 14,12% python [unknown] [k] 0xffffffff8140290c 9,93% python libc-2.23.so [.] __memmove_ssse3_back 4,61% python umath.cpython-36m-x86_64-linux-gnu.so [.] DOUBLE_isnan 2,01% python umath.cpython-36m-x86_64-linux-gnu.so [.] BOOL_logical_not
Much less data-copying, but still more than in the numpy's version which is mostly responsible for the overhead.
My key take-aways from it:
pandas has the potential to be at least slightly faster than numpy (because it is possible to be faster). However, pandas' somewhat opaque handling of data-copying makes it hard to predict when this potential is overshadowed by (unnecessary) data copying.
when the performance of
where
/mask
is the bottleneck, I would use numba/cython to improve the performance - see my rather naive tries to use numba and cython further below.
The idea is to take
np.where(df[0] > 0.5, df[0]*2, df[0])
version and to eliminate the need to create a temporary - i.e, df[0]*2
.
As proposed by @max9111, using numba:
import numba as nb@nb.njitdef nb_where(df): n = len(df) output = np.empty(n, dtype=np.float64) for i in range(n): if df[i]>0.5: output[i] = 2.0*df[i] else: output[i] = df[i] return outputassert(np.where(df[0] > 0.5, twice, df[0])==nb_where(df[0].values)).all()%timeit np.where(df[0] > 0.5, df[0]*2, df[0])# 85.1 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)%timeit nb_where(df[0].values)# 17.4 ms ± 673 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Which is about factor 5 faster than the numpy's version!
And here is my by far less successful try to improve the performance with help of Cython:
%%cython -acimport numpy as npimport numpy as npcimport cython@cython.boundscheck(False)@cython.wraparound(False)def cy_where(double[::1] df): cdef int i cdef int n = len(df) cdef np.ndarray[np.float64_t] output = np.empty(n, dtype=np.float64) for i in range(n): if df[i]>0.5: output[i] = 2.0*df[i] else: output[i] = df[i] return outputassert (df[0].mask(df[0] > 0.5, 2*df[0]).values == cy_where(df[0].values)).all()%timeit cy_where(df[0].values)# 66.7± 753 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
gives 25% speed-up. Not sure, why cython is so much slower than numba though.
Listings:
A: np_where.py:
import pandas as pdimport numpy as npnp.random.seed(0)n = 10000000df = pd.DataFrame(np.random.random(n))twice = df[0]*2for _ in range(50): np.where(df[0] > 0.5, twice, df[0])
B: pd_mask.py:
import pandas as pdimport numpy as npnp.random.seed(0)n = 10000000df = pd.DataFrame(np.random.random(n))twice = df[0]*2mask = df[0] > 0.5for _ in range(50): df[0].mask(mask, twice)