Why is it faster to perform float by float matrix multiplication compared to int by int?
All those vector-vector and matrix-vector operations are using BLAS internally. BLAS, optimized over decades for different archs, cpus, instructions and cache-sizes has no integer-type!
Here is some branch of OpenBLAS working on it (and some tiny discussion at google-groups linking it).
And i think i heard Intel's MKL (Intel's BLAS implementation) might be working on integer-types too. This talk looks interesting (mentioned in that forum), although it's short and probably more approaching small integral types useful in embedded Deep-Learning).
If you compile these two simple functions which essentially just calculate a product (using the Eigen library)
#include <Eigen/Core>int mult_int(const Eigen::MatrixXi& A, Eigen::MatrixXi& B){ Eigen::MatrixXi C= A*B; return C(0,0);}int mult_float(const Eigen::MatrixXf& A, Eigen::MatrixXf& B){ Eigen::MatrixXf C= A*B; return C(0,0);}
using the flags -mavx2 -S -O3
you will see very similar assembler code, for the integer and the float version.The main difference however is that vpmulld
has 2-3 times the latency and just 1/2 or 1/4 the throughput of vmulps
. (On recent Intel architectures)
Reference: Intel Intrinsics Guide, "Throughput" means the reciprocal throughput, i.e., how many clock-cycles are used per operation, if no latency happens (somewhat simplified).