Which ordering of nested loops for iterating over a 2D array is more efficient [duplicate] Which ordering of nested loops for iterating over a 2D array is more efficient [duplicate] c c

Which ordering of nested loops for iterating over a 2D array is more efficient [duplicate]


The first method is slightly better, as the cells being assigned to lays next to each other.

First method:

[ ][ ][ ][ ][ ] ....^1st assignment   ^2nd assignment[ ][ ][ ][ ][ ] ....^101st assignment

Second method:

[ ][ ][ ][ ][ ] ....^1st assignment   ^101st assignment[ ][ ][ ][ ][ ] ....^2nd assignment


  1. For array[100][100] - they are both the same, if the L1 cache is larger then 100*100*sizeof(int) == 10000*sizeof(int) == [usually] 40000. Note in Sandy Bridge - 100*100 integers should be enough elements to see a difference, since the L1 cache is only 32k.

  2. Compilers will probably optimize this code all the same

  3. Assuming no compiler optimizations, and matrix does not fit in L1 cache - the first code is better due to cache performance [usually]. Every time an element is not found in cache - you get a cache miss - and need to go to the RAM or L2 cache [which are much slower]. Taking elements from RAM to cache [cache fill] is done in blocks [usually 8/16 bytes] - so in the first code, you get at most miss rate of 1/4 [assuming 16 bytes cache block, 4 bytes ints] while in the second code it is unbounded, and can be even 1. In the second code snap - elements that were already in cache [inserted in the cache fill for the adjacent elements] - were taken out, and you get a redundant cache miss.

    • This is closely related to the principle of locality, which is the general assumption used when implementing the cache system. The first code follows this principle while the second doesn't - so cache performance of the first will be better of those of the second.

Conclusion:For all cache implementations I am aware of - the first will be not worse then the second. They might be the same - if there is no cache at all or all the array fits in cache completely - or due to compiler optimization.


This sort of micro-optimization is platform-dependent so you'll need to profile the code in order to be able to draw a reasonable conclusion.