Why is numpy's einsum faster than numpy's built in functions? Why is numpy's einsum faster than numpy's built in functions? numpy numpy

Why is numpy's einsum faster than numpy's built in functions?


First off, there's been a lot of past discussion about this on the numpy list. For example, see:http://numpy-discussion.10968.n7.nabble.com/poor-performance-of-sum-with-sub-machine-word-integer-types-td41.htmlhttp://numpy-discussion.10968.n7.nabble.com/odd-performance-of-sum-td3332.html

Some of boils down to the fact that einsum is new, and is presumably trying to be better about cache alignment and other memory access issues, while many of the older numpy functions focus on a easily portable implementation over a heavily optimized one. I'm just speculating, there, though.


However, some of what you're doing isn't quite an "apples-to-apples" comparison.

In addition to what @Jamie already said, sum uses a more appropriate accumulator for arrays

For example, sum is more careful about checking the type of the input and using an appropriate accumulator. For example, consider the following:

In [1]: x = 255 * np.ones(100, dtype=np.uint8)In [2]: xOut[2]:array([255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,       255, 255, 255, 255, 255, 255, 255, 255, 255], dtype=uint8)

Note that the sum is correct:

In [3]: x.sum()Out[3]: 25500

While einsum will give the wrong result:

In [4]: np.einsum('i->', x)Out[4]: 156

But if we use a less limited dtype, we'll still get the result you'd expect:

In [5]: y = 255 * np.ones(100)In [6]: np.einsum('i->', y)Out[6]: 25500.0


Now that numpy 1.8 is released, where according to the docs all ufuncs should use SSE2, I wanted to double check that Seberg's comment about SSE2 was valid.

To perform the test a new python 2.7 install was created- numpy 1.7 and 1.8 were compiled with icc using standard options on a AMD opteron core running Ubuntu.

This is the test run both before and after the 1.8 upgrade:

import numpy as npimport timeitarr_1D=np.arange(5000,dtype=np.double)arr_2D=np.arange(500**2,dtype=np.double).reshape(500,500)arr_3D=np.arange(500**3,dtype=np.double).reshape(500,500,500)print 'Summation test:'print timeit.timeit('np.sum(arr_3D)',                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',                      number=5)/5print timeit.timeit('np.einsum("ijk->", arr_3D)',                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',                      number=5)/5print '----------------------\n'print 'Power test:'print timeit.timeit('arr_3D*arr_3D*arr_3D',                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',                      number=5)/5print timeit.timeit('np.einsum("ijk,ijk,ijk->ijk", arr_3D, arr_3D, arr_3D)',                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',                      number=5)/5print '----------------------\n'print 'Outer test:'print timeit.timeit('np.outer(arr_1D, arr_1D)',                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',                      number=5)/5print timeit.timeit('np.einsum("i,k->ik", arr_1D, arr_1D)',                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',                      number=5)/5print '----------------------\n'print 'Einsum test:'print timeit.timeit('np.sum(arr_2D*arr_3D)',                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',                      number=5)/5print timeit.timeit('np.einsum("ij,oij->", arr_2D, arr_3D)',                      'import numpy as np; from __main__ import arr_1D, arr_2D, arr_3D',                      number=5)/5print '----------------------\n'

Numpy 1.7.1:

Summation test:0.1729885101320.0934836149216----------------------Power test:1.935246896740.839519000053----------------------Outer test:0.1303808212280.121401786804----------------------Einsum test:0.9790524959560.126066613197

Numpy 1.8:

Summation test:0.1165515899660.0920487880707----------------------Power test:1.236836194990.815982818604----------------------Outer test:0.1318081760410.127472200394----------------------Einsum test:0.7817500114440.129271841049

I think this is fairly conclusive that SSE plays a large role in the timing differences, it should be noted that repeating these tests the timings very by only ~0.003s. The remaining difference should be covered in the other answers to this question.


I think these timings explain what's going on:

a = np.arange(1000, dtype=np.double)%timeit np.einsum('i->', a)100000 loops, best of 3: 3.32 us per loop%timeit np.sum(a)100000 loops, best of 3: 6.84 us per loopa = np.arange(10000, dtype=np.double)%timeit np.einsum('i->', a)100000 loops, best of 3: 12.6 us per loop%timeit np.sum(a)100000 loops, best of 3: 16.5 us per loopa = np.arange(100000, dtype=np.double)%timeit np.einsum('i->', a)10000 loops, best of 3: 103 us per loop%timeit np.sum(a)10000 loops, best of 3: 109 us per loop

So you basically have an almost constant 3us overhead when calling np.sum over np.einsum, so they basically run as fast, but one takes a little longer to get going. Why could that be? My money is on the following:

a = np.arange(1000, dtype=object)%timeit np.einsum('i->', a)Traceback (most recent call last):...TypeError: invalid data type for einsum%timeit np.sum(a)10000 loops, best of 3: 20.3 us per loop

Not sure what is going on exactly, but it seems that np.einsum is skipping some checks to extract type specific functions to do the multiplications and additions, and is going directly with * and + for standard C types only.


The multidimensional cases are not different:

n = 10; a = np.arange(n**3, dtype=np.double).reshape(n, n, n)%timeit np.einsum('ijk->', a)100000 loops, best of 3: 3.79 us per loop%timeit np.sum(a)100000 loops, best of 3: 7.33 us per loopn = 100; a = np.arange(n**3, dtype=np.double).reshape(n, n, n)%timeit np.einsum('ijk->', a)1000 loops, best of 3: 1.2 ms per loop%timeit np.sum(a)1000 loops, best of 3: 1.23 ms per loop

So a mostly constant overhead, not a faster running once they get down to it.