Multi-threaded integer matrix multiplication in NumPy/SciPy
Note that while this answer gets old numpy might gain optimized integer support. Please verify if this answer still works faster on your setup.
- Option 5 - Roll a custom solution: Partition the matrix product in a few sub-products and perform these in parallel. This can be relatively easy implemented with standard Python modules. The sub-products are computed with
numpy.dot
, which releases the global interpreter lock. Thus, it is possible to use threads which are relatively lightweight and can access the arrays from the main thread for memory efficiency.
Implementation:
import numpy as npfrom numpy.testing import assert_array_equalimport threadingfrom time import timedef blockshaped(arr, nrows, ncols): """ Return an array of shape (nrows, ncols, n, m) where n * nrows, m * ncols = arr.shape. This should be a view of the original array. """ h, w = arr.shape n, m = h // nrows, w // ncols return arr.reshape(nrows, n, ncols, m).swapaxes(1, 2)def do_dot(a, b, out): #np.dot(a, b, out) # does not work. maybe because out is not C-contiguous? out[:] = np.dot(a, b) # less efficient because the output is stored in a temporary array?def pardot(a, b, nblocks, mblocks, dot_func=do_dot): """ Return the matrix product a * b. The product is split into nblocks * mblocks partitions that are performed in parallel threads. """ n_jobs = nblocks * mblocks print('running {} jobs in parallel'.format(n_jobs)) out = np.empty((a.shape[0], b.shape[1]), dtype=a.dtype) out_blocks = blockshaped(out, nblocks, mblocks) a_blocks = blockshaped(a, nblocks, 1) b_blocks = blockshaped(b, 1, mblocks) threads = [] for i in range(nblocks): for j in range(mblocks): th = threading.Thread(target=dot_func, args=(a_blocks[i, 0, :, :], b_blocks[0, j, :, :], out_blocks[i, j, :, :])) th.start() threads.append(th) for th in threads: th.join() return outif __name__ == '__main__': a = np.ones((4, 3), dtype=int) b = np.arange(18, dtype=int).reshape(3, 6) assert_array_equal(pardot(a, b, 2, 2), np.dot(a, b)) a = np.random.randn(1500, 1500).astype(int) start = time() pardot(a, a, 2, 4) time_par = time() - start print('pardot: {:.2f} seconds taken'.format(time_par)) start = time() np.dot(a, a) time_dot = time() - start print('np.dot: {:.2f} seconds taken'.format(time_dot))
With this implementation I get a speedup of approximately x4, which is the physical number of cores in my machine:
running 8 jobs in parallelpardot: 5.45 seconds takennp.dot: 22.30 seconds taken
"Why is it faster to perform float by float matrix multiplication compared to int by int?" explains why integers are so slow: First, the CPUs have high-throughput floating point pipelines. Second, BLAS has no integer-type.
Workaround: Converting the matrices to float32
values gets big speedups. How's 90x speedup on a 2015 MacBook Pro? (Using float64
is half as good.)
import numpy as npimport timedef timeit(callable): start = time.time() callable() end = time.time() return end - starta = np.random.random_integers(0, 9, size=(1000, 1000)).astype(np.int8)timeit(lambda: a.dot(a)) # ≈0.9 sectimeit(lambda: a.astype(np.float32).dot(a.astype(np.float32)).astype(np.int8) ) # ≈0.01 sec