Why GCC does not use LOAD(without fence) and STORE+SFENCE for Sequential Consistency? Why GCC does not use LOAD(without fence) and STORE+SFENCE for Sequential Consistency? multithreading multithreading

Why GCC does not use LOAD(without fence) and STORE+SFENCE for Sequential Consistency?


Consider the following code:

#include <atomic>#include <cstring>std::atomic<int> a;char b[64];void seq() {  /*    movl    $0, a(%rip)    mfence  */  int temp = 0;  a.store(temp, std::memory_order_seq_cst);}void rel() {  /*    movl    $0, a(%rip)   */  int temp = 0;  a.store(temp, std::memory_order_relaxed);}

With respect to the atomic variable "a", seq() and rel() are both ordered and atomic on the x86 architecture because:

  1. mov is an atomic instruction
  2. mov is a legacy instruction and Intel promises ordered memory semantics for legacy instructions to be compatible with old processors that always used ordered memory semantics.

No fence is required to store a constant value into an atomic variable. The fences are there because std::memory_order_seq_cst implies that all memory is synchronized, not only the memory that holds the atomic variable.

The effect can be demonstrated by the following set and get functions:

void set(const char *s) {  strcpy(b, s);  int temp = 0;  a.store(temp, std::memory_order_seq_cst);}const char *get() {  int temp = 0;  a.store(temp, std::memory_order_seq_cst);  return b;}

strcpy is a library function that might use newer sse instructions if such are available in runtime. Since sse instructions were not available in old processors there is no requirement on backwards compatibility and memory order is undefined. Thus the result of a strcpy in one thread might not be directly visible in other threads.

The set and get functions above uses an atomic value to enforce memory synchronization so that the result of strcpy becomes visible in other threads. Now the fences matters, but the order of them inside the call to atomic::store is not significant since the fences are not needed internally in atomic::store.


The only reordering x86 does (for normal memory accesses) is that it can potentially reorder a load that follows a store.

SFENCE guarantees that all stores before the fence complete before all stores after the fence. LFENCE guarantees that all loads before the fence complete before all loads after the fence. For normal memory accesses, the ordering guarantees of individual SFENCE or LFENCE operations are already provided by default. Basically, LFENCE and SFENCE by themselves are only useful for the weaker memory access modes of x86.

Neither LFENCE, SFENCE, nor LFENCE + SFENCE prevents a store followed by a load from being reordered. MFENCE does.

The relevant reference is the Intel x86 architectural manual.


std::atomic<int>::store is mapped to the compiler intrinsic __atomic_store_n. (This and other atomic-operation intrinsics are documented here: Built-in functions for memory model aware atomic operations.) The _n suffix makes it type-generic; the back-end actually implements variants for specific sizes in bytes. int on x86 is AFAIK always 32 bits long, so that means we're looking for the definition of __atomic_store_4. The internals manual for this version of GCC says that the __atomic_store operations correspond to machine description patterns named atomic_storeā€Œmode; the mode corresponding to a 4-byte integer is "SI" (that's documented here), so we are looking for something called "atomic_storesi" in the x86 machine description. And that brings us to config/i386/sync.md, specifically this bit:

(define_expand "atomic_store<mode>"  [(set (match_operand:ATOMIC 0 "memory_operand")        (unspec:ATOMIC [(match_operand:ATOMIC 1 "register_operand")                        (match_operand:SI 2 "const_int_operand")]                       UNSPEC_MOVA))]  ""{  enum memmodel model = (enum memmodel) (INTVAL (operands[2]) & MEMMODEL_MASK);  if (<MODE>mode == DImode && !TARGET_64BIT)    {      /* For DImode on 32-bit, we can use the FPU to perform the store.  */      /* Note that while we could perform a cmpxchg8b loop, that turns         out to be significantly larger than this plus a barrier.  */      emit_insn (gen_atomic_storedi_fpu                 (operands[0], operands[1],                  assign_386_stack_local (DImode, SLOT_TEMP)));    }  else    {      /* For seq-cst stores, when we lack MFENCE, use XCHG.  */      if (model == MEMMODEL_SEQ_CST && !(TARGET_64BIT || TARGET_SSE2))        {          emit_insn (gen_atomic_exchange<mode> (gen_reg_rtx (<MODE>mode),                                                operands[0], operands[1],                                                operands[2]));          DONE;        }      /* Otherwise use a store.  */      emit_insn (gen_atomic_store<mode>_1 (operands[0], operands[1],                                           operands[2]));    }  /* ... followed by an MFENCE, if required.  */  if (model == MEMMODEL_SEQ_CST)    emit_insn (gen_mem_thread_fence (operands[2]));  DONE;})

Without going into a great deal of detail, the bulk of this is a C function body that will be called to generate the low-level "RTL" intermediate representation of the atomic store operation. When it's invoked by your example code, <MODE>mode != DImode, model == MEMMODEL_SEQ_CST, and TARGET_SSE2 is true, so it will call gen_atomic_store<mode>_1 and then gen_mem_thread_fence. The latter function always generates mfence. (There is code in this file to produce sfence, but I believe it is only used for explicitly-coded _mm_sfence (from <xmmintrin.h>).)

The comments suggest that someone thought MFENCE was required in this case. I conclude that either you are mistaken to think a load fence is not required, or this is a missed optimization bug in GCC. It is not, for instance, an error in how you are using the compiler.