Difference on performance between numpy and matlab Difference on performance between numpy and matlab numpy numpy

Difference on performance between numpy and matlab


It would be wrong to say "Matlab is always faster than NumPy" or viceversa. Often their performance is comparable. When using NumPy, to get goodperformance you have to keep in mind that NumPy's speed comes from callingunderlying functions written in C/C++/Fortran. It performs well when you applythose functions to whole arrays. In general, you get poorer performance when you call those NumPy function on smaller arrays or scalars in a Python loop.

What's wrong with a Python loop you ask? Every iteration through the Python loop isa call to a next method. Every use of [] indexing is a call to a__getitem__ method. Every += is a call to __iadd__. Every dotted attributelookup (such as in like np.dot) involves function calls. Those function callsadd up to a significant hinderance to speed. These hooks give Pythonexpressive power -- indexing for strings means something different than indexingfor dicts for example. Same syntax, different meanings. The magic is accomplished by giving the objects different __getitem__ methods.

But that expressive power comes at a cost in speed. So when you don't need allthat dynamic expressivity, to get better performance, try to limit yourself toNumPy function calls on whole arrays.

So, remove the for-loop; use "vectorized" equations when possible. For example, instead of

for i in range(m):    delta3 = -(x[i,:]-a3[i,:])*a3[i,:]* (1 - a3[i,:])    

you can compute delta3 for each i all at once:

delta3 = -(x-a3)*a3*(1-a3)

Whereas in the for-loop delta3 is a vector, using the vectorized equation delta3 is a matrix.


Some of the computations in the for-loop do not depend on i and therefore should be lifted outside the loop. For example, sum2 looks like a constant:

sum2 = sparse.beta*(-float(sparse.rho)/rhoest + float(1.0 - sparse.rho) / (1.0 - rhoest) )

Here is a runnable example with an alternative implementation (alt) of your code (orig).

My timeit benchmark shows a 6.8x improvement in speed:

In [52]: %timeit orig()1 loops, best of 3: 495 ms per loopIn [53]: %timeit alt()10 loops, best of 3: 72.6 ms per loop

import numpy as npclass Bunch(object):    """ http://code.activestate.com/recipes/52308 """    def __init__(self, **kwds):        self.__dict__.update(kwds)m, n, p = 10 ** 4, 64, 25sparse = Bunch(    theta1=np.random.random((p, n)),    theta2=np.random.random((n, p)),    b1=np.random.random((p, 1)),    b2=np.random.random((n, 1)),)x = np.random.random((m, n))a3 = np.random.random((m, n))a2 = np.random.random((m, p))a1 = np.random.random((m, n))sum2 = np.random.random((p, ))sum2 = sum2[:, np.newaxis]def orig():    partial_j1 = np.zeros(sparse.theta1.shape)    partial_j2 = np.zeros(sparse.theta2.shape)    partial_b1 = np.zeros(sparse.b1.shape)    partial_b2 = np.zeros(sparse.b2.shape)    delta3t = (-(x - a3) * a3 * (1 - a3)).T    for i in range(m):        delta3 = delta3t[:, i:(i + 1)]        sum1 = np.dot(sparse.theta2.T, delta3)        delta2 = (sum1 + sum2) * a2[i:(i + 1), :].T * (1 - a2[i:(i + 1), :].T)        partial_j1 += np.dot(delta2, a1[i:(i + 1), :])        partial_j2 += np.dot(delta3, a2[i:(i + 1), :])        partial_b1 += delta2        partial_b2 += delta3        # delta3: (64, 1)        # sum1: (25, 1)        # delta2: (25, 1)        # a1[i:(i+1),:]: (1, 64)        # partial_j1: (25, 64)        # partial_j2: (64, 25)        # partial_b1: (25, 1)        # partial_b2: (64, 1)        # a2[i:(i+1),:]: (1, 25)    return partial_j1, partial_j2, partial_b1, partial_b2def alt():    delta3 = (-(x - a3) * a3 * (1 - a3)).T    sum1 = np.dot(sparse.theta2.T, delta3)    delta2 = (sum1 + sum2) * a2.T * (1 - a2.T)    # delta3: (64, 10000)    # sum1: (25, 10000)    # delta2: (25, 10000)    # a1: (10000, 64)    # a2: (10000, 25)    partial_j1 = np.dot(delta2, a1)    partial_j2 = np.dot(delta3, a2)    partial_b1 = delta2.sum(axis=1)    partial_b2 = delta3.sum(axis=1)    return partial_j1, partial_j2, partial_b1, partial_b2answer = orig()result = alt()for a, r in zip(answer, result):    try:        assert np.allclose(np.squeeze(a), r)    except AssertionError:        print(a.shape)        print(r.shape)        raise

Tip: Notice that I left in the comments the shape of all the intermediate arrays. Knowing the shape of the arrays helped me understand what your code was doing. The shape of the arrays can help guide you toward the right NumPy functions to use. Or at least, paying attention to the shapes can help you know if an operation is sensible. For example, when you compute

np.dot(A, B)

and A.shape = (n, m) and B.shape = (m, p), then np.dot(A, B) will be an array of shape (n, p).


It can help to build the arrays in C_CONTIGUOUS-order (at least, if using np.dot). There might be as much as a 3x speed up by doing so:

Below, x is the same as xf except that x is C_CONTIGUOUS andxf is F_CONTIGUOUS -- and the same relationship for y and yf.

import numpy as npm, n, p = 10 ** 4, 64, 25x = np.random.random((n, m))xf = np.asarray(x, order='F')y = np.random.random((m, n))yf = np.asarray(y, order='F')assert np.allclose(x, xf)assert np.allclose(y, yf)assert np.allclose(np.dot(x, y), np.dot(xf, y))assert np.allclose(np.dot(x, y), np.dot(xf, yf))

%timeit benchmarks show the difference in speed:

In [50]: %timeit np.dot(x, y)100 loops, best of 3: 12.9 ms per loopIn [51]: %timeit np.dot(xf, y)10 loops, best of 3: 27.7 ms per loopIn [56]: %timeit np.dot(x, yf)10 loops, best of 3: 21.8 ms per loopIn [53]: %timeit np.dot(xf, yf)10 loops, best of 3: 33.3 ms per loop

Regarding benchmarking in Python:

It can be misleading to use the difference in pairs of time.time() calls to benchmark the speed of code in Python.You need to repeat the measurement many times. It's better to disable the automatic garbage collector. It is also important to measure large spans of time (such as at least 10 seconds worth of repetitions) to avoid errors due to poor resolution in the clock timer and to reduce the significance of time.time call overhead. Instead of writing all that code yourself, Python provides you with the timeit module. I'm essentially using that to time the pieces of code, except that I'm calling it through an IPython terminal for convenience.

I'm not sure if this is affecting your benchmarks, but be aware it could make a difference. In the question I linked to, according to time.time two pieces of code differed by a factor of 1.7x while benchmarks using timeit showed the pieces of code ran in essentially identical amounts of time.


I would start with inplace operations to avoid to allocate new arrays every time:

partial_j1 += np.dot(delta2, a1[i,:].reshape(1,a1.shape[1]))partial_j2 += np.dot(delta3, a2[i,:].reshape(1,a2.shape[1]))partial_b1 += delta2partial_b2 += delta3

You can replace this expression:

a1[i,:].reshape(1,a1.shape[1])

with a simpler and faster (thanks to Bi Rico):

a1[i:i+1]

Also, this line:

sum2 = sparse.beta*(-float(sparse.rho)/rhoest + float(1.0 - sparse.rho) / (1.0 - rhoest))

seems to be the same at each loop, you don't need to recompute it.

And, a probably minor optimization, you can replace all the occurrences of x[i,:] with x[i].

Finally, if you can afford to allocate the m times more memory, you can follow unutbu suggestion and vectorize the loop:

for m in range(m):    delta3 = -(x[i]-a3[i])*a3[i]* (1 - a3[i])

with:

delta3 = -(x-a3)*a3*(1-a3)

And you can always use Numba and gain in speed significantly without vectorizing (and without using more memory).


Difference in performance between numpy and matlab have always frustrated me. They often in the end boil down to the underlying lapack libraries. As far as I know matlab uses the full atlas lapack as a default while numpy uses a lapack light. Matlab reckons people dont care about space and bulk, while numpy reckons people do. Similar question with a good answer.