Is memory reordering visible to other threads on a uniprocessor? Is memory reordering visible to other threads on a uniprocessor? multithreading multithreading

Is memory reordering visible to other threads on a uniprocessor?


As a C++ question the answer must be that the program contains a data race, so the behavior is undefined. In reality that means that it could print something other than 42.

That is independent of the underlying hardware. As has been pointed out, the loop can be optimized away and the compiler can reorder the assignments in thread 1, so that result can occur even on uniprocessor machines.

[I'll assume that with "uniprocessor" machine, you mean processors with a single core and hardware thread.]

You now say, that you want to assume compiler reordering or loop elimination does not happen. With this, we have left the realm of C++ and are really asking about corresponding machine instructions. If you want to eliminate compiler reordering, we can probably also rule out any form of SIMD instructions and consider only instructions operating on a single memory location at a time.

So essentially thread1 has two store instructions in the order store-to-x store-to-f, while thread2 has test-f-and-loop-if-not-zero (this may be multiple instructions, but involves a load-from-f) and then a load-from-x.

On any hardware architecture I am aware of or can reasonably imagine, thread 2 will print 42.

One reason is that, if instructions processed by a single processors are not sequentially consistent among themselves, you could hardly assert anything about the effects of a program.

The only event that could interfere here, is an interrupt (as is used to trigger a preemptive context switch). A hypothetical machine that stores the entire state of its current execution pipeline state upon an interrupt and restores it upon return from the interrupt, could produce a different result, but such a machine is impractical and afaik does not exist. These operations would create quite a bit of additional complexity and/or require additional redundant buffers or registers, all for no good reason - except to break your program. Real processors either flush or roll back the current pipeline upon interrupt, which is enough to guarantee sequential consistency for all instructions on a single hardware thread.

And there is no memory model issue to worry about. The weaker memory models originate from the separate buffers and caches that separate the separate hardware processors from the main memory or nth level cache they actually share. A single processor has no similarly partitioned resources and no good reason to have them for multiple (purely software) threads. Again there is no reason to complicate the architecture and waste resources to make the processor and/or memory subsystem aware of something like separate thread contexts, if there aren't separate processing resources (processors/hardware threads) to keep these resources busy.


A strong memory ordering execute memory access instructions with the exact same order as defined in the program, it is often referred as "program ordering".

Weaker memory ordering may be employed to allow the processor reorder memory access for better performance, it is often referred as "processor ordering".

AFAIK, the scenario described above is NOT possible in the Intel ia32 architecture, whose processor ordering outlaws such cases. The relevant rules are (intel ia-32 software development manual Vol3A 8.2 Memory Ordering) :

writes are not reordered with other writes, with the exception of streaming stores, CLFLUSH and string operations.

To illustrate the rule, it gives an example similar to this:

memory location x, y, initialized to 0;

thread 1:

mov [x] 1mov [y] 1

thread 2:

mov r1 [y]mov r2 [x]

r1 == 1 and r2 == 0 is not allowed

In your example, thread 1 cannot store f before storing x.

@Eric in respond to your comments.

fast string store instruction "stosd", may store string out of order inside its operation. In a multiprocessor environment, when a processor store a string "str", another processor may observe str[1] being written before str[0], while the logic order presumed to be writing str[0] before str[1];

But these instructions are not reorder with any other stores. and must have precise exception handling. When exception occurs in the middle of stosd, the implementation may choose to delay it so that all out-of-order sub-stores (doesn't necessarily mean the whole stosd instruction) must commit before the context switch.

Edited to address the claims made on as if this is a C++ question:

Even this is considered in the context of C++, As I understand, a standard confirming compiler should NOT reorder the assignment of x and f in thread 1.

$1.9.14Every value computation and side effect associated with a full-expression is sequenced before every valuecomputation and side effect associated with the next full-expression to be evaluated.


This isn't really a C or C++ question, since you've explicitly assumed no load/store re-ordering, which compilers for both languages are perfectly allowed to do.

Allowing that assumption for the sake of argument, note that loop may anyway never exit, unless you either:

  • give the compiler some reason to believe f may change (eg, by passing its address to some non-inlineable function which could modify it)
  • mark it volatile, or
  • make it an explicitly atomic type and request acquire semantics

On the hardware side, your worry about physical memory being "committed" during a context switch isn't an issue. Both software threads share the same memory hardware and cache, so there's no risk of inconsistency there whatever consistency/coherence protocol pertains between cores.

Say both stores were issued, and the memory hardware decides to re-order them. What does this really mean? Perhaps f's address is already in cache, so it can be written immediately, but x's store is deferred until that cache line is fetched. Well, a read from x is dependent on the same address, so either:

  • the load can't happen until the fetch happens, in which case a sane implementation must issue the queued store before the queued load
  • or the load can peek into the queue and fetch x's value without waiting for the write

Consider anyway that the kernel pre-emption required to switch threads will itself issue whatever load/store barriers are required for consistency of the kernel scheduler state, and it should be obvious that hardware re-ordering can't be a problem in this situation.


The real issue (which you're trying to avoid) is your assumption that there is no compiler re-ordering: this is simply wrong.