CPU Relax instruction and C++11 primitives
The PAUSE
instruction is x86 specific. It's sole use is in spin-lock wait loops, where it:
Improves the performance of spin-wait loops. When executing a “spin-wait loop,” processors will suffer a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop.
Also:
Inserting a pause instruction in a spinwait loop greatly reduces the processor’s power consumption.
Where you put this instruction in a spin-lock loop is also x86_64 specific. I cannot speak for the C++11 standards folk, but I think it is reasonable for them to conclude that the right place for this magic is in the relevant library... along with all the other magic required to implement atomics, mutexes etc.
NB: the PAUSE
does not release the processor to allow another thread to run. It is not a "low-level" pthread_yield()
. (Although on Intel Hyperthreaded cores, it does prevent the spin-lock thread from hogging the core.) The essential function of the PAUSE
appears to be to turn off the usual instruction execution optimisations and pipelining, which slows the thread down (a bit), but having discovered the lock is busy, this reduces the rate at which the lock variable is touched, so that the cache system is not being pounded by the waiter while the current owner of the lock is trying to get on with real work.
Note that the primitives being used to "hand roll" spin-locks, mutexes etc. are not OS specific, but processor-specific.
I'm not sure I would describe a "hand rolled" spin-lock as "lockless" !
FWIW, the Intel recommendation for a spin-lock ("Intel® 64 and IA-32 Architectures Optimization Reference Manual") is:
Spin_Lock: CMP lockvar, 0 // Check if lock is free. JE Get_lock PAUSE // Short delay. JMP Spin_Lock Get_Lock: MOV EAX, 1 XCHG EAX, lockvar // Try to get lock. CMP EAX, 0 // Test if successful. JNE Spin_Lock
Clearly one can write something which compiles to this, using a std::atomic_flag
... or use pthread_spin_lock()
, which on my machine is:
pthread_spin_lock: lock decl (%rdi) jne wait xor %eax, %eax ret wait: pause cmpl $0, (%rdi) jg pthread_spin_lock jmp wait
which is hard to fault, really.