CPU Relax instruction and C++11 primitives CPU Relax instruction and C++11 primitives multithreading multithreading

CPU Relax instruction and C++11 primitives


The PAUSE instruction is x86 specific. It's sole use is in spin-lock wait loops, where it:

Improves the performance of spin-wait loops. When executing a “spin-wait loop,” processors will suffer a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop.

Also:

Inserting a pause instruction in a spinwait loop greatly reduces the processor’s power consumption.

Where you put this instruction in a spin-lock loop is also x86_64 specific. I cannot speak for the C++11 standards folk, but I think it is reasonable for them to conclude that the right place for this magic is in the relevant library... along with all the other magic required to implement atomics, mutexes etc.

NB: the PAUSE does not release the processor to allow another thread to run. It is not a "low-level" pthread_yield(). (Although on Intel Hyperthreaded cores, it does prevent the spin-lock thread from hogging the core.) The essential function of the PAUSE appears to be to turn off the usual instruction execution optimisations and pipelining, which slows the thread down (a bit), but having discovered the lock is busy, this reduces the rate at which the lock variable is touched, so that the cache system is not being pounded by the waiter while the current owner of the lock is trying to get on with real work.

Note that the primitives being used to "hand roll" spin-locks, mutexes etc. are not OS specific, but processor-specific.

I'm not sure I would describe a "hand rolled" spin-lock as "lockless" !

FWIW, the Intel recommendation for a spin-lock ("Intel® 64 and IA-32 Architectures Optimization Reference Manual") is:

  Spin_Lock:    CMP   lockvar, 0     // Check if lock is free.    JE    Get_lock    PAUSE                // Short delay.    JMP   Spin_Lock  Get_Lock:    MOV   EAX, 1    XCHG  EAX, lockvar  // Try to get lock.    CMP   EAX, 0        // Test if successful.    JNE   Spin_Lock

Clearly one can write something which compiles to this, using a std::atomic_flag... or use pthread_spin_lock(), which on my machine is:

  pthread_spin_lock:    lock decl (%rdi)    jne    wait    xor    %eax, %eax    ret  wait:    pause    cmpl   $0, (%rdi)    jg     pthread_spin_lock    jmp    wait

which is hard to fault, really.