How to get faster code than numpy.dot for matrix multiplication?
np.dot
dispatches to BLAS when
- NumPy has been compiled to use BLAS,
- a BLAS implementation is available at run-time,
- your data has one of the dtypes
float32
,float64
,complex32
orcomplex64
, and - the data is suitably aligned in memory.
Otherwise, it defaults to using its own, slow, matrix multiplication routine.
Checking your BLAS linkage is described here. In short, check whether there's a file _dotblas.so
or similar in your NumPy installation. When there is, check which BLAS library it's linked against; the reference BLAS is slow, ATLAS is fast, OpenBLAS and vendor-specific versions such as Intel MKL are even faster. Watch out with multithreaded BLAS implementations as they don't play nicely with Python's multiprocessing
.
Next, check your data alignment by inspecting the flags
of your arrays. In versions of NumPy before 1.7.2, both arguments to np.dot
should be C-ordered. In NumPy >= 1.7.2, this doesn't matter as much anymore as special cases for Fortran arrays have been introduced.
>>> X = np.random.randn(10, 4)>>> Y = np.random.randn(7, 4).T>>> X.flags C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False>>> Y.flags C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False
If your NumPy is not linked against BLAS, either (easy) re-install it, or (hard) use the BLAS gemm
(generalized matrix multiply) function from SciPy:
>>> from scipy.linalg import get_blas_funcs>>> gemm = get_blas_funcs("gemm", [X, Y])>>> np.all(gemm(1, X, Y) == np.dot(X, Y))True
This looks easy, but it does hardly any error checking, so you must really know what you're doing.