Implementing a ticket lock with atomics generates extra mov Implementing a ticket lock with atomics generates extra mov multithreading multithreading

Implementing a ticket lock with atomics generates extra mov


First of all, I think you need to use std::memory_order_acquire, since you're acquiring a lock. If you use mo_relaxed, you could potentially see stale data from before some of the stores that the previous lock-holder did. Jeff Preshing's blog is excellent, and he has a post on release/acquire semantics.

On x86, this can only happen if the compiler re-orders loads and stores, which mo_relaxed tells it it's allowed to. An acquire load compiles the same as a relaxed load on x86, but without reordering. Every x86 asm load is already an acquire. On weakly-ordered architectures that need it, you'll get whatever instructions are needed for a load-acquire. (And on x86, you'll just stop the compiler from reordering).


I put a version of the code on godbolt to look at the asm with various compilers.


Well spotted, this does look like a gcc optimization failure, still present at least in 6.0 (checked with Wandbox, using a main that does return execlp("objdump", "objdump", "-Mintel", "-d", argv[0], NULL); to dump disassembler output of itself, including the functions we're interested in.

It looks like clang 3.7 does even worse with this. It does a 16bit load, then zero-extends, then compares.

gcc treats atomic loads specially, and apparently fails to notice it can fold it into the compare. Probably the optimization pass that could have done that happened while the atomic load was still represented differently from regular loads, or something. I'm not a gcc hacker, so this is mostly guesswork.

I suspect either you have an old gcc (4.9.2 or older), or you're building on / for AMD, because your compiler used rep ret even with -march=native. You should do something about that if you care about generating optimal code. I've noticed gcc5 making better code than gcc 4.9 sometimes. (not that it helps in this case, though :/)


I tried using uint32_t, with no luck.

The performance impact of doing the load and compare separately is probably irrelevant since this function is a busy-wait loop.

The fast path (unlocked case, where the loop condition is false on the first iteration) is still only one taken branch, and a ret. However, in the std:atomic version, the fast path goes through the loop branch. So instead of two separate branch-predictor entries (one for the fast path and one for the spin-loop), now spinning will probably cause a branch mispredict in the next unlocked case. This is probably not an issue, and the new code does take one fewer branch-predictor entries.

IDK if jumping into the middle of a loop had any ill effects on the uop cache of Intel SnB-family microarchitectures. It is something of a trace cache. Agner Fog's testing found that the same piece of code can have multiple entries in the uop cache if it has multiple jump entry points. This function is already somewhat uop-cache unfriendly, since it starts with a mov r, imm / lock xadd. The lock xadd has to go in a uop cache line by itself, because it's microcoded (more than 4 uops. 9 in fact). An unconditional jump always ends a uop cache line. I'm not sure about a taken conditional branch, but I'd guess a taken jcc ends a cache line if it was predicted taken when it was decoded. (e.g. branch predictor entry still good, but old uop cache entry evicted).

So the first version is potentially 3 uops cache lines for the fast-path: one mov (and if inlined, hopefully mostly full with previous instructions), one lock xadd alone, one macro-fused cmp/je to following code (if inlined. If not, the jump targets the ret, which may end up as a 4th cache line for this 32byte code block, which isn't allowed. So the non-inlined version of this may always have to get re-decoded every time?)

The std::atomic version is again one uop-cache line for the initial mov imm (and preceding instructions), then lock xadd, then add / jmp, then ... uh oh, 4th cache-line needed for the movzx / compare-and-branch uops. So this version is more likely to have a decode bottleneck even when inlined.

Fortunately, the frontend can still gain some ground and get instructions queued up for the OOO core while running this code, because lock xadd is 9 uops. That's enough to cover a cycle or two of fewer uops from the frontend, and a switch between decoding and uop-cache fetching.

The main problem here is just code-size, since you prob. want this inlined. Speed-wise, the fast-path is only slightly worse, and the non-fast path is a spin-loop anyway so it doesn't matter.

The fast-path is 11 fused-domain uops for the old version (1 mov imm, 9 lock xadd, 1 cmp/je macro fused). The cmp/je includes a micro-fused memory operand.

The fast-path is 41 fused-domain uops for the new version (1 mov imm, 9 lock xadd, 1 add, 1 jmp, 1 movzx, 1 cmp/je macro fused).

Using an add instead of just using an 8bit offset in the addressing mode of movzx is really shooting itself in the foot, IMO. IDK if gcc thinks ahead far enough to make choices like that to have the loop branch target come out at a 16B boundary, or if that was just dumb luck.


Compiler-identification experiment on godbolt using the OP's code:

  • gcc 4.8 and earlier: always use rep ret when it's a branch target, even with -march=native -mtune=core2 (on a Haswell), or with just -march=core2.
  • gcc 4.9: Uses rep ret with -march=native on Haswell, probably because Haswell is too new for it. -march=native -mtune=haswell uses just ret, so it does know the name haswell.
  • gcc 5.1 and later: uses ret with -march=native (on Haswell). Still uses rep ret when -march isn't specified.


In first one

12:   66 39 47 02             cmp    %ax,0x2(%rdi)

cmp is combined mov and cmp instruction (it is very likely to generate two instructions in microarchitecture instruction set)

The atomic variant is doing separate read for now_serving with

32:   0f b7 07                movzwl (%rdi),%eax

and then does same compare with

35:   66 39 c2                cmp    %ax,%dx