Understanding the difference between vectorizing in Numpy and multi-threading of vectorized expression via Numexpr Understanding the difference between vectorizing in Numpy and multi-threading of vectorized expression via Numexpr multithreading multithreading

Understanding the difference between vectorizing in Numpy and multi-threading of vectorized expression via Numexpr


NumPy may in some cases use a library which uses multiple processes to do the processing and thus spreads the burden onto several cores. This, however, depends on the library and does not have much to do with the python code in NumPy. So, yes, NumPy and any other library may overcome these restrictions if it is not written in python. There are even some libraries offering GPU accelerated functions.

NumExpr uses the same method to bypass GIL. From their homepage:

Also, numexpr implements support for multi-threading computations straight into its internal virtual machine, written in C. This allows to bypass the GIL in Python

However, there are some fundamental differences between NumPy and NumExpr. Numpy is concentrated on creating a good Pythonic interface for array operations, NumExpr has a much narrower scope and its own language. When NumPy performs the calculation c = 3*a + 4*b where operands are arrays, two temporary arrays are created in the process (3*a and 4*b). In this case NumExpr may optimize the calculation so that the multiplications and addition are performed element-by-element without using any intermediate results.

This leads to some interesting things with NumPy. The following tests have been carried out with a 4-core 8-thread i7 processor, and timing has been taken care with iPython's %timeit:

import numpy as npimport numexpr as nedef addtest_np(a, b): a + bdef addtest_ne(a, b): ne.evaluate("a+b")def addtest_np_inplace(a, b): a += bdef addtest_ne_inplace(a, b): ne.evaluate("a+b", out=a)def addtest_np_constant(a): a + 3def addtest_ne_constant(a): ne.evaluate("a+3")def addtest_np_constant_inplace(a): a += 3def addtest_ne_constant_inplace(a): ne.evaluate("a+3", out=a)a_small = np.random.random((100,10))b_small = np.random.random((100,10))a_large = np.random.random((100000, 1000))b_large = np.random.random((100000, 1000))# results: (time given is in nanoseconds per element with small/large array)# np: NumPy# ne8: NumExpr with 8 threads# ne1: NumExpr with 1 thread## a+b:#  np: 2.25 / 4.01#  ne8: 22.6 / 3.22#  ne1: 22.6 / 4.21# a += b:#  np: 1.5 / 1.26 #  ne8: 36.8 / 1.18#  ne1: 36.8 / 1.48# a+3:#  np: 4.8 / 3.62#  ne8: 10.9 / 3.09#  ne1: 20.2 / 4.04# a += 3:#  np: 3.6 / 0.79#  ne8: 34.9 / 0.81#  ne1: 34.4 / 1.06

Of course, with the timing methods used this is not very accurate, but there are certain general trends:

  • NumPy uses fewer cloc cycles (np < ne1)
  • parallelism helps a bit with very large arrays (10-20 %)
  • NumExpr is much slower with small arrays
  • NumPy is very strong with in-place operations

NumPy does not make simple arithmetic operations parallel, but as can be seen from above, it does not really matter. The speed is mostly limited by the memory bandwidth, not processing power.

If we do something more complicated, things change.

np.sin(a_large)               # 19.4 ns/elementne.evaluate("sin(a_large)")   # 5.5 ns/element

The speed is no longer limited by the memory bandwidth. To see if this really is due to threading (and not due to NumExpr sometimes using some fast libraries):

ne.set_num_threads(1)ne.evaluate("sin(a_large)")    # 34.3 ns/element

Here parallelism helps really a lot.

NumPy may use parallel processing with more complicated linear operations, such as matrix inversions. These operations are not supported by NumExpr, so there is no meaningful comparison. The actual speed depends on the library used (BLAS/Atlas/LAPACK). Also, when performing complex operations such as FFT, the performance depends on the library. (AFAIK, NumPy/SciPy does not have fftw support yet.)

As a summary, there seem to be cases where NumExpr is very fast and useful. Then there are cases where NumPy is fastest. If you have rage arrays and elementwise operations, NumExpr is very strong. However, it should be noted that some parallelism (or even spreading the calculations across computers) is often quite easy to incorporate into the code with multiprocessing or something equivalent.


The question about "multi-processing" and "multi-threading" is a bit tricky, as the terminology is a bit wobbly. In python "thread" is something that runs under the same GIL, but if we talk about operating system threads and processes, there may not be any difference between the two. For example, in Linux there is no difference between the two.


Numexpr is nice with calculations like array multiplication and reduction in one go - also when using numpy memmap as input. So in operations like (ij,jk->i) numexpr is awesome oneliner, when in numpy it become (ij,jk -> ik -> i). Like in here numpy - python - way to do fast matrix multiplication and reduction while working in memmaps and CPU - Stack Overflow.