Multi-threaded integer matrix multiplication in NumPy/SciPy

python multithreading numpy matrix-multiplication blas

Note that while this answer gets old numpy might gain optimized integer support. Please verify if this answer still works faster on your setup.

Option 5 - Roll a custom solution: Partition the matrix product in a few sub-products and perform these in parallel. This can be relatively easy implemented with standard Python modules. The sub-products are computed with numpy.dot, which releases the global interpreter lock. Thus, it is possible to use threads which are relatively lightweight and can access the arrays from the main thread for memory efficiency.

Implementation:

import numpy as npfrom numpy.testing import assert_array_equalimport threadingfrom time import timedef blockshaped(arr, nrows, ncols):    """    Return an array of shape (nrows, ncols, n, m) where    n * nrows, m * ncols = arr.shape.    This should be a view of the original array.    """    h, w = arr.shape    n, m = h // nrows, w // ncols    return arr.reshape(nrows, n, ncols, m).swapaxes(1, 2)def do_dot(a, b, out):    #np.dot(a, b, out)  # does not work. maybe because out is not C-contiguous?    out[:] = np.dot(a, b)  # less efficient because the output is stored in a temporary array?def pardot(a, b, nblocks, mblocks, dot_func=do_dot):    """    Return the matrix product a * b.    The product is split into nblocks * mblocks partitions that are performed    in parallel threads.    """    n_jobs = nblocks * mblocks    print('running {} jobs in parallel'.format(n_jobs))    out = np.empty((a.shape[0], b.shape[1]), dtype=a.dtype)    out_blocks = blockshaped(out, nblocks, mblocks)    a_blocks = blockshaped(a, nblocks, 1)    b_blocks = blockshaped(b, 1, mblocks)    threads = []    for i in range(nblocks):        for j in range(mblocks):            th = threading.Thread(target=dot_func,                                   args=(a_blocks[i, 0, :, :],                                         b_blocks[0, j, :, :],                                         out_blocks[i, j, :, :]))            th.start()            threads.append(th)    for th in threads:        th.join()    return outif __name__ == '__main__':    a = np.ones((4, 3), dtype=int)    b = np.arange(18, dtype=int).reshape(3, 6)    assert_array_equal(pardot(a, b, 2, 2), np.dot(a, b))    a = np.random.randn(1500, 1500).astype(int)    start = time()    pardot(a, a, 2, 4)    time_par = time() - start    print('pardot: {:.2f} seconds taken'.format(time_par))    start = time()    np.dot(a, a)    time_dot = time() - start    print('np.dot: {:.2f} seconds taken'.format(time_dot))

With this implementation I get a speedup of approximately x4, which is the physical number of cores in my machine:

running 8 jobs in parallelpardot: 5.45 seconds takennp.dot: 22.30 seconds taken

python multithreading numpy matrix-multiplication blas

"Why is it faster to perform float by float matrix multiplication compared to int by int?" explains why integers are so slow: First, the CPUs have high-throughput floating point pipelines. Second, BLAS has no integer-type.

Workaround: Converting the matrices to float32 values gets big speedups. How's 90x speedup on a 2015 MacBook Pro? (Using float64 is half as good.)

import numpy as npimport timedef timeit(callable):    start = time.time()    callable()    end = time.time()    return end - starta = np.random.random_integers(0, 9, size=(1000, 1000)).astype(np.int8)timeit(lambda: a.dot(a))  # ≈0.9 sectimeit(lambda: a.astype(np.float32).dot(a.astype(np.float32)).astype(np.int8) )  # ≈0.01 sec

CodeHunter

Multi-threaded integer matrix multiplication in NumPy/SciPy

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last