How does memory reordering help processors and compilers? How does memory reordering help processors and compilers? multithreading multithreading

How does memory reordering help processors and compilers?


TL;DR: It gives the compiler and hardware more room to take advantage of the as-if rule by not requiring it to preserve all behaviour of the original source, only the result of the single thread itself.

Taking the externally-observable (from other threads) ordering of loads/stores out of the picture as something that optimizations must preserve gives the compiler a lot of room to merge things into fewer operations. For the hardware, delaying stores is the big one, but for compilers all kinds of reordering can help.


(See part-way down for a section on why it helps the compiler)

Why it helps hardware

Hardware reordering earlier stores with later loads (StoreLoad reordering) inside the CPU is essential for out-of-order execution. (See below).

Other kinds of reordering (e.g. StoreStore reordering, which is the subject of your question) aren't essential, and high performance CPUs can be built with only StoreLoad reordering, not the other three kinds. (The prime example being tag:x86, where every store is a release-store, every load is an acquire-load. See the tag wiki for more details.)

Some people, like Linus Torvalds, argue that reordering stores with other stores doesn't help the hardware much, because hardware already has to track store-ordering to support out-of-order execution of a single thread. (A single thread always runs as if all of its own stores/loads happen in program order.) See other posts in that thread on realworldtech if you're curious. And/or if you find Linus's mix of insults and sensible technical arguments entertaining :P


For Java, the issue is that, architectures exist where the hardware doesn't provide these ordering guarantees. Weak memory ordering is a common feature of RISC ISAs like ARM, PowerPC, and MIPS. (But not SPARC-TSO). The reasons behind that design decision are the same ones being argued over in the realworldtech thread I linked: make the hardware simpler, and let software request ordering when needed.

So Java's architect(s) didn't have much of a choice: Implementing a JVM for an architecture with a weaker memory model than the Java standard would require a store-barrier instruction after every single store, and a load-barrier before every load. (Except when the JVM's JIT-compiler can prove that no other thread can have a reference to that variable.) Running barrier instructions all the time is slow.

A strong memory model for Java would make efficient JVMs on ARM (and other ISAs) impossible. Proving that barriers aren't needed is near-impossible, requiring AI levels of global program-understanding. (This goes WAY beyond what normal optimizers do).


Why it helps compilers

(see also Jeff Preshing's excellent blog post on C++ compile-time reordering. This basically applies to Java when you include JIT-compiling to native code as part of the process.)

Another reason for keeping the Java and C/C++ memory models weak is to allow more optimizations. Since other threads are allowed (by the weak memory model) to observe our stores and loads in any order, aggressive transformations are allowed even when the code involves stores to memory.

e.g. in a case like Davide's example:

c.a = 1;c.b = 1;c.a++;c.b++;// same observable effects as the much simplerc.a = 2;c.b = 2;

There's no requirement that other threads be able to observe the intermediate states. So a compiler can just compile that to c.a = 2; c.b = 2;, either at Java-compile time or when the bytecode is JIT-compiled to machine code.

It's common for a method that increments something to be called multiple times from another method. Without this rule, turning it into c.a += 4 could only happen if the compiler could prove that no other thread could observe the difference.

C++ programmers sometimes make the mistake of thinking that since they're compiling for x86, they don't need std::atomic<int> to get some ordering guarantees for a shared variable. This is wrong, because optimizations happen based on the as-if rule for the language memory model, not the target hardware.


More technical hardware explanations:

Why StoreLoad reordering helps performance:

Once a store is committed into cache, it becomes globally visible to threads running on other cores (via the cache-coherency protocol). At that point, it's too late to roll it back (another core might already have gotten a copy of the value). So it can't happen until it's known for certain that the store won't fault, and neither will any instruction before it. and the store's data is ready. And that there wasn't a branch-mispredict at some point earlier, etc. etc. i.e. we need to rule out all cases of mis-speculation before we can retire a store instruction.

Without StoreLoad reordering, every load would have to wait for all preceding stores to retire (i.e. be totally finished executing, having committed the data to cache) before they could read a value from cache for use by later instructions that depend on the value loaded. (The moment when a load copies a value from cache into a register is when it's globally visible to other threads.)

Since you can't know what's happening on other cores, I don't think hardware could hide this delay in starting loads by speculating that it isn't a problem, and then detecting mis-speculation after the fact. (And treat it like a branch mispredict: throw away all work done that depended on that load, and re-issue it.) A core might be able to allow speculative early loads from cache lines that were in Exclusive or Modified state, since they can't be present in other cores. (Detecting mis-speculation if a cache-coherency request for that cache line came in from another CPU before retiring the last store before the speculative load.) Anyway, this is obviously a large amount of complexity that isn't needed for anything else.

Note that I haven't even mentioned cache-misses for stores. That increases the latency of a store from a few cycles to hundreds of cycles.


How actual CPUs work (when StoreLoad reordering is allowed):

I included some links as part of a brief intro to computer architecture in the early part of my answer on Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs. That might be helpful, or more confusing, if you're finding this hard to follow.

CPUs avoid WAR and WAW pipeline hazards for stores by buffering them in a store queue until store instructions are ready to retire. Loads from the same core have to check the store queue (to preserve the appearance of in-order execution for a single thread, otherwise you'd need memory-barrier instructions before loading anything that might have been stored recently!). The store queue is invisible to other threads; stores only become globally visible when the store instruction retires, but loads become globally visible as soon as they execute. (And can use values prefetched into the cache well ahead of that).

See also this answer I wrote explaining store buffers and how they decouple execution from cache-miss store commit, and allow speculative execution of stores. Also wikipedia's article on the classic RISC pipeline has some stuff for simpler CPUs. A store-buffer inherently creates StoreLoad reordering (and also store-forwarding so a core can see its own stores before they become globally visible, assuming the core can do store forwarding instead of stalling.)

So out-of-order execution is possible for stores, but they're only reordered inside the store queue. Since instructions have to retire in order to support precise exceptions, there doesn't appear to be much benefit at all to having the hardware enforce StoreStore ordering.

Since loads become globally visible when they execute, enforcing LoadLoad ordering may require delaying loads after a load that misses in cache. Of course, in reality the CPU would speculatively execute the following loads, and detect a memory-order mis-speculation if it occurs. This is nearly essential for good performance: A large part of the benefit of out-of-order execution is to keep doing useful work, hiding the latency of cache misses.


One of Linus' arguments is that weakly-ordered CPUs require multi-threaded code to use a lot of memory barrier instructions, so they'll need to be cheap for multi-threaded code to not suck. That's only possible if you have hardware tracking the dependency ordering of loads and stores.

But if you have that hardware tracking of dependencies, you can just have the hardware enforce ordering all the time, so software doesn't have to run as many barrier instructions. If you have hardware support to make barriers cheap, why not just make them implicit on every load / store, like x86 does.

His other major argument is that memory ordering is HARD, and a major source of bugs. Getting it right once in hardware is better than every software project having to get it right. (This argument only works because it's possible in hardware without huge performance overhead.)


Imagine to have the following code:

a = 1;b = 1;a = a + 1;   // Not present in the registerb = b + 1;   // Not present in the registera = a + 1;   // Not present in the registerb = b + 1;   // Not present in the register// Here both a and b has value 3

A possible optimization using memory reorder is

a = 1;a = a + 1;   // Already in the registera = a + 1;   // Already in the registerb = 1;b = b + 1;   // Already in the registerb = b + 1;   // Already in the register// Here both a and b has value 3

The performance is better because the data are presents in the register.

Note that there are many different levels of optimization, but this will give you an idea why reordering can improve performances.


On a modern processor chip, the processor can typically perform register to register operations an order of magnitude (or more) faster that fetching from main memory. Operations that hit the L1 or L2 caches are faster than main memory, slower than register to register. The other thing to note is that modern processors chips typically use a pipeline which allows different parts of different instructions to be executed at the same time.

With this in mind, reordering of operations is typically done to avoid situations where the pipeline (fast) has to wait for an operation on main memory (slow) to complete:

  • Davide's example illustrates reordering that avoids memory reads and writes entirely. (At least, that is his intention. In reality, the reordering is done at the native instruction level, not the source code or bytecode level.)

  • In other cases, you might find that the instructions to do a = a + 1 and b = b + 1 get interleaved; e.g.

    1) load a -> r12) load b -> r23) r1 + 1 -> r34) r2 + 1 -> r45) save r3 -> a6) save r4 -> b

    In a pipeline architecture, this might allow 2) and 3) to happen at the same time, 4) and 5) to happen at the same time and so on.

The final thing to note is that a modern processor chip / instruction set avoids reading from main memory and writing to main memory as much as possible. Indeed, it is common for a write instruction to write into L1 or L2 cache, and delay the (slow) write to main memory until the cache-line is flushed. This leads to a different kind of "memory anomaly" ... where a separate thread running on a different core does not see memory updates because the respective writes have not (yet) been flushed.

The Java Memory Model is designed to allow the compiler / processor to optimize performance of a multi-threaded application, as above. It makes it clear when one thread is guaranteed to see memory changes made by another thread. The compiler / processor are allowed to reorder, etc in cases where are no visibility guarantees. This reordering can make a big difference in overall performance.