Pandas mask / where methods versus NumPy np.where Pandas mask / where methods versus NumPy np.where pandas pandas

Pandas mask / where methods versus NumPy np.where


I'm using pandas 0.23.3 and Python 3.6, so I can see a real difference in running time only for your second example.

But let's investigate a slightly different version of your second example (so we get2*df[0] out of the way). Here is our baseline on my machine:

twice = df[0]*2mask = df[0] > 0.5%timeit np.where(mask, twice, df[0])  # 61.4 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)%timeit df[0].mask(mask, twice)# 143 ms ± 5.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Numpy's version is about 2.3 times faster than pandas.

So let's profile both functions to see the difference - profiling is a good way to get the big picture when one isn't very familiar with the code basis: it is faster than debugging and less error-prone than trying to figure out what's going on just by reading the code.

I'm on Linux and use perf. For the numpy's version we get (for the listing see appendix A):

>>> perf record python np_where.py>>> perf reportOverhead  Command  Shared Object                                Symbol                                68,50%  python   multiarray.cpython-36m-x86_64-linux-gnu.so   [.] PyArray_Where   8,96%  python   [unknown]                                    [k] 0xffffffff8140290c   1,57%  python   mtrand.cpython-36m-x86_64-linux-gnu.so       [.] rk_random

As we can see, the lion's share of the time is spent in PyArray_Where - about 69%. The unknown symbol is a kernel function (as matter of fact clear_page) - I run without root privileges so the symbol is not resolved.

And for pandas we get (see Appendix B for code):

>>> perf record python pd_mask.py>>> perf reportOverhead  Command  Shared Object                                Symbol                                                                                                 37,12%  python   interpreter.cpython-36m-x86_64-linux-gnu.so  [.] vm_engine_iter_task  23,36%  python   libc-2.23.so                                 [.] __memmove_ssse3_back  19,78%  python   [unknown]                                    [k] 0xffffffff8140290c   3,32%  python   umath.cpython-36m-x86_64-linux-gnu.so        [.] DOUBLE_isnan   1,48%  python   umath.cpython-36m-x86_64-linux-gnu.so        [.] BOOL_logical_not

Quite a different situation:

  • pandas doesn't use PyArray_Where under the hood - the most prominent time-consumer is vm_engine_iter_task, which is numexpr-functionality.
  • there is some heavy memory-copying going on - __memmove_ssse3_back uses about 25% of time! Probably some of the kernel's functions are also connected to memory-accesses.

Actually, pandas-0.19 used PyArray_Where under the hood, for the older version the perf-report would look like:

Overhead  Command        Shared Object                     Symbol                                                                                                       32,42%  python         multiarray.so                     [.] PyArray_Where  30,25%  python         libc-2.23.so                      [.] __memmove_ssse3_back  21,31%  python         [kernel.kallsyms]                 [k] clear_page   1,72%  python         [kernel.kallsyms]                 [k] __schedule

So basically it would use np.where under the hood + some overhead (all above data-copying, see __memmove_ssse3_back) back then.

I see no scenario where pandas could become faster than numpy in pandas' version 0.19 - it just adds overhead to numpy's functionality. Pandas' version 0.23.3 is an entirely different story - here numexpr-module is used, it is very possible that there are scenarios for which pandas' version is (at least slightly) faster.

I'm not sure this memory-copying is really called for/necessary - maybe one even could call it performance-bug, but I just don't know enough to be certain.

We could help pandas not to copy, by peeling away some indirections (passing np.array instead of pd.Series). For example:

%timeit df[0].mask(mask.values > 0.5, twice.values)# 75.7 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Now, pandas is only 25% slower. The perf says:

Overhead  Command  Shared Object                                Symbol                                                                                                  50,81%  python   interpreter.cpython-36m-x86_64-linux-gnu.so  [.] vm_engine_iter_task  14,12%  python   [unknown]                                    [k] 0xffffffff8140290c   9,93%  python   libc-2.23.so                                 [.] __memmove_ssse3_back   4,61%  python   umath.cpython-36m-x86_64-linux-gnu.so        [.] DOUBLE_isnan   2,01%  python   umath.cpython-36m-x86_64-linux-gnu.so        [.] BOOL_logical_not

Much less data-copying, but still more than in the numpy's version which is mostly responsible for the overhead.

My key take-aways from it:

  • pandas has the potential to be at least slightly faster than numpy (because it is possible to be faster). However, pandas' somewhat opaque handling of data-copying makes it hard to predict when this potential is overshadowed by (unnecessary) data copying.

  • when the performance of where/mask is the bottleneck, I would use numba/cython to improve the performance - see my rather naive tries to use numba and cython further below.


The idea is to take

np.where(df[0] > 0.5, df[0]*2, df[0])

version and to eliminate the need to create a temporary - i.e, df[0]*2.

As proposed by @max9111, using numba:

import numba as nb@nb.njitdef nb_where(df):    n = len(df)    output = np.empty(n, dtype=np.float64)    for i in range(n):        if df[i]>0.5:            output[i] = 2.0*df[i]        else:            output[i] = df[i]    return outputassert(np.where(df[0] > 0.5, twice, df[0])==nb_where(df[0].values)).all()%timeit np.where(df[0] > 0.5, df[0]*2, df[0])# 85.1 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)%timeit nb_where(df[0].values)# 17.4 ms ± 673 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Which is about factor 5 faster than the numpy's version!

And here is my by far less successful try to improve the performance with help of Cython:

%%cython -acimport numpy as npimport numpy as npcimport cython@cython.boundscheck(False)@cython.wraparound(False)def cy_where(double[::1] df):    cdef int i    cdef int n = len(df)    cdef np.ndarray[np.float64_t] output = np.empty(n, dtype=np.float64)    for i in range(n):        if df[i]>0.5:            output[i] = 2.0*df[i]        else:            output[i] = df[i]    return outputassert (df[0].mask(df[0] > 0.5, 2*df[0]).values == cy_where(df[0].values)).all()%timeit cy_where(df[0].values)# 66.7± 753 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

gives 25% speed-up. Not sure, why cython is so much slower than numba though.


Listings:

A: np_where.py:

import pandas as pdimport numpy as npnp.random.seed(0)n = 10000000df = pd.DataFrame(np.random.random(n))twice = df[0]*2for _ in range(50):      np.where(df[0] > 0.5, twice, df[0])  

B: pd_mask.py:

import pandas as pdimport numpy as npnp.random.seed(0)n = 10000000df = pd.DataFrame(np.random.random(n))twice = df[0]*2mask = df[0] > 0.5for _ in range(50):      df[0].mask(mask, twice)