Vectorized way of calculating row-wise dot product two matrices with Scipy Vectorized way of calculating row-wise dot product two matrices with Scipy numpy numpy

Vectorized way of calculating row-wise dot product two matrices with Scipy


Straightforward way to do that is:

import numpy as npa=np.array([[1,2,3],[3,4,5]])b=np.array([[1,2,3],[1,2,3]])np.sum(a*b, axis=1)

which avoids the python loop and is faster in cases like:

def npsumdot(x, y):    return np.sum(x*y, axis=1)def loopdot(x, y):    result = np.empty((x.shape[0]))    for i in range(x.shape[0]):        result[i] = np.dot(x[i], y[i])    return resulttimeit npsumdot(np.random.rand(500000,50),np.random.rand(500000,50))# 1 loops, best of 3: 861 ms per looptimeit loopdot(np.random.rand(500000,50),np.random.rand(500000,50))# 1 loops, best of 3: 1.58 s per loop


Check out numpy.einsum for another method:

In [52]: aOut[52]: array([[1, 2, 3],       [3, 4, 5]])In [53]: bOut[53]: array([[1, 2, 3],       [1, 2, 3]])In [54]: einsum('ij,ij->i', a, b)Out[54]: array([14, 26])

Looks like einsum is a bit faster than inner1d:

In [94]: %timeit inner1d(a,b)1000000 loops, best of 3: 1.8 us per loopIn [95]: %timeit einsum('ij,ij->i', a, b)1000000 loops, best of 3: 1.6 us per loopIn [96]: a = random.randn(10, 100)In [97]: b = random.randn(10, 100)In [98]: %timeit inner1d(a,b)100000 loops, best of 3: 2.89 us per loopIn [99]: %timeit einsum('ij,ij->i', a, b)100000 loops, best of 3: 2.03 us per loop

Note: NumPy is constantly evolving and improving; the relative performance of the functions shown above has probably changed over the years. If performance is important to you, run your own tests with the version of NumPy that you will be using.


Played around with this and found inner1d the fastest. That function however is internal, so a more robust approach is to use

numpy.einsum("ij,ij->i", a, b)

Even better is to align your memory such that the summation happens in the first dimension, e.g.,

a = numpy.random.rand(3, n)b = numpy.random.rand(3, n)numpy.einsum("ij,ij->j", a, b)

For 10 ** 3 <= n <= 10 ** 6, this is the fastest method, and up to twice as fast as its untransposed equivalent. The maximum occurs when the level-2 cache is maxed out, at about 2 * 10 ** 4.

Note also that the transposed summation is much faster than its untransposed equivalent.

enter image description here

enter image description here

The plot was created with perfplot (a small project of mine)

import numpyfrom numpy.core.umath_tests import inner1dimport perfplotdef setup(n):    a = numpy.random.rand(n, 3)    b = numpy.random.rand(n, 3)    aT = numpy.ascontiguousarray(a.T)    bT = numpy.ascontiguousarray(b.T)    return (a, b), (aT, bT)b = perfplot.bench(    setup=setup,    n_range=[2 ** k for k in range(1, 25)],    kernels=[        lambda data: numpy.sum(data[0][0] * data[0][1], axis=1),        lambda data: numpy.einsum("ij, ij->i", data[0][0], data[0][1]),        lambda data: numpy.sum(data[1][0] * data[1][1], axis=0),        lambda data: numpy.einsum("ij, ij->j", data[1][0], data[1][1]),        lambda data: inner1d(data[0][0], data[0][1]),    ],    labels=["sum", "einsum", "sum.T", "einsum.T", "inner1d"],    xlabel="len(a), len(b)",)b.save("out1.png")b.save("out2.png", relative_to=3)