Why does the order of loops in a matrix multiply algorithm affect performance? [duplicate] Why does the order of loops in a matrix multiply algorithm affect performance? [duplicate] c c

Why does the order of loops in a matrix multiply algorithm affect performance? [duplicate]


I believe that what you're looking at is the effects of locality of reference in the computer's memory hierarchy.

Typically, computer memory is segregated into different types that have different performance characteristics (this is often called the memory hierarchy). The fastest memory is in the processor's registers, which can (usually) be accessed and read in a single clock cycle. However, there are usually only a handful of these registers (usually no more than 1KB). The computer's main memory, on the other hand, is huge (say, 8GB), but is much slower to access. In order to improve performance, the computer is usually physically constructed to have several levels of caches in-between the processor and main memory. These caches are slower than registers but much faster than main memory, so if you do a memory access that looks something up in the cache it tends to be a lot faster than if you have to go to main memory (typically, between 5-25x faster). When accessing memory, the processor first checks the memory cache for that value before going back to main memory to read the value in. If you consistently access values in the cache, you will end up with much better performance than if you're skipping around memory, randomly accessing values.

Most programs are written in a way where if a single byte in memory is read into memory, the program later reads multiple different values from around that memory region as well. Consequently, these caches are typically designed so that when you read a single value from memory, a block of memory (usually somewhere between 1KB and 1MB) of values around that single value is also pulled into the cache. That way, if your program reads the nearby values, they're already in the cache and you don't have to go to main memory.

Now, one last detail - in C/C++, arrays are stored in row-major order, which means that all of the values in a single row of a matrix are stored next to each other. Thus in memory the array looks like the first row, then the second row, then the third row, etc.

Given this, let's look at your code. The first version looks like this:

  for (int i = 0; i < n; i++)      for (int j = 0; j < n; j++)          for (int k = 0; k < n; k++)              c[i][j] = c[i][j] + a[i][k]*b[k][j];

Now, let's look at that innermost line of code. On each iteration, the value of k is changing increasing. This means that when running the innermost loop, each iteration of the loop is likely to have a cache miss when loading the value of b[k][j]. The reason for this is that because the matrix is stored in row-major order, each time you increment k, you're skipping over an entire row of the matrix and jumping much further into memory, possibly far past the values you've cached. However, you don't have a miss when looking up c[i][j] (since i and j are the same), nor will you probably miss a[i][k], because the values are in row-major order and if the value of a[i][k] is cached from the previous iteration, the value of a[i][k] read on this iteration is from an adjacent memory location. Consequently, on each iteration of the innermost loop, you are likely to have one cache miss.

But consider this second version:

  for (int i = 0; i < n; i++)      for (int k = 0; k < n; k++)          for (int j = 0; j < n; j++)              c[i][j] = c[i][j] + a[i][k]*b[k][j];

Now, since you're increasing j on each iteration, let's think about how many cache misses you'll likely have on the innermost statement. Because the values are in row-major order, the value of c[i][j] is likely to be in-cache, because the value of c[i][j] from the previous iteration is likely cached as well and ready to be read. Similarly, b[k][j] is probably cached, and since i and k aren't changing, chances are a[i][k] is cached as well. This means that on each iteration of the inner loop, you're likely to have no cache misses.

Overall, this means that the second version of the code is unlikely to have cache misses on each iteration of the loop, while the first version almost certainly will. Consequently, the second loop is likely to be faster than the first, as you've seen.

Interestingly, many compilers are starting to have prototype support for detecting that the second version of the code is faster than the first. Some will try to automatically rewrite the code to maximize parallelism. If you have a copy of the Purple Dragon Book, Chapter 11 discusses how these compilers work.

Additionally, you can optimize the performance of this loop even further using more complex loops. A technique called blocking, for example, can be used to notably increase performance by splitting the array into subregions that can be held in cache longer, then using multiple operations on these blocks to compute the overall result.

Hope this helps!


This may well be the memory locality. When you reorder the loop, the memory that's needed in the inner-most loop is nearer and can be cached, while in the inefficient version you need to access memory from the entire data set.

The way to test this hypothesis is to run a cache debugger (like cachegrind) on the two pieces of code and see how many cache misses they incur.


Apart from locality of memory there is also compiler optimisation. A key one for vector and matrix operations is loop unrolling.

for (int k = 0; k < n; k++)   c[i][j] = c[i][j] + a[i][k]*b[k][j];

You can see in this inner loop i and j do not change. This means it can be rewritten as

for (int k = 0; k < n; k+=4) {   int * aik = &a[i][k];   c[i][j] +=         + aik[0]*b[k][j]         + aik[1]*b[k+1][j]         + aik[2]*b[k+2][j]         + aik[3]*b[k+3][j];}

You can see there will be

  • four times fewer loops and accesses to c[i][j]
  • a[i][k] is being accessed continuously in memory
  • the memory accesses and multiplies can be pipelined (almost concurrently) in the CPU.

What if n is not a multiple of 4 or 6 or 8? (or whatever the compiler decides to unroll it to) The compiler handles this tidy up for you. ;)

To speed up this solution faster, you could try transposing the b matrix first. This is a little extra work and coding, but it means that accesses to b-transposed are also continuous in memory. (As you are swapping [k] with [j])

Another thing you can do to improve performance is to multi-thread the multiplication. This can improve performance by a factor of 3 on a 4 core CPU.

Lastly you might consider using float or double You might think int would be faster, however that is not always the case as floating point operations can be more heavily optimised (both in hardware and the compiler)

The second example has c[i][j] is changing on each iteration which makes it harder to optimise.