Threadsafe lazy initialization: static vs std::call_once vs double checked locking Threadsafe lazy initialization: static vs std::call_once vs double checked locking multithreading multithreading

Threadsafe lazy initialization: static vs std::call_once vs double checked locking


GCC uses platform specific tricks to avoid atomic operations entirely on the fast path, leveraging the fact that it can do analysis of static better than call_once or double-checking.

Because double-checking uses atomics as its method of avoiding race cases, it has to pay the price of an acquire every time. It's not a high price, but it's a price.

It has to pay this because atomics have to remain atomic in all cases, even difficult operations like compare-exchange. This makes it very hard to optimize out. Generally speaking, the compiler has to leave it in, just in case you use the variable for more than just a double-lock. It has no easy way of proving that you never use one of the more complicated operations on your atomic.

On the other hand, static is highly specialized, and part of the language. It was designed, from the start, to be very easy to provably initialize. Accordingly, the compiler can take shortcuts that were not available to the more generic version. The compiler actually emits the following code for a static:

a simple function:

void foo() {    static X x;}

is rewritten inside GCC to:

void foo() {    static X x;    static guard x_is_initialized;    if ( __cxa_guard_acquire(x_is_initialized) ) {        X::X();        x_is_initialized = true;        __cxa_guard_release(x_is_initialized);    }}

Which looks a lot like a double-checked lock. However, the compiler gets to cheat a little here. It knows the user can never write use a cxa_guard directly. It knows that it is only used in the special circumstances where the compiler chooses to use it. Thus, with that extra information, it can save some time. The CXA guard specifications, as distributed as they are, all share a common rule: __cxa_guard_acquire will never modify the first byte of the guard, and __cxa_guard__release will set it to non-zero.

This means each guard has to be monotonic, and it specifies exactly what operations will do so. Accordingly it can take advantage of existing race-case protections within the host platform. On x86, for instance, the LL/SS protection guaranteed by the strongly synchronized CPUs turns out to be enough to do this acquire/release pattern, so it can do a raw read of that first byte when it does its double locking, rather than an acquire-read. This is only possible because GCC isn't using the C++ atomic API to do its double locking -- it is using a platform specific approach.

GCC cannot optimize out the atomic in the general case. On architectures which are designed to be less synchronized (such as those designed for 1024+ cores), GCC doesn't get to rely on the archetecture to do LL/SS for it. Thus GCC is forced to actually emit the atomic. However, on common platforms such as x86 and x64, it can be faster.

call_once can have the efficiency of GCC's statics, because it similarly limits the number of operations which can be done to a once_flag to a fraction of the functions that can be applied to an atomic. The tradeoff is that statics are far more convenient to use, when they are applicable, but call_once works in many cases where statics are insufficient (such as a once_flag owned by a dynamically generated object).

There is a slight difference in performance between static and call_once on these higher platforms. Many of these platforms, while not offering LL/SS, will at least offer non-tearing reads of an integer. These platforms can use this, and a thread-specific-pointer, to do per-thread epoch counting to avoid atomics. This is sufficient for static or call_once, but depends on the counter not rolling over. If you do not have a tearing-free 64-bit integer, call_once has to worry about rollover. The implementation may or may not worry about this. If it ignores this issue, it can be as fast as statics. If it pays attention to that issue, it has to be as slow as atomics. Static knows at compile time how many static variables/blocks there are, so it can prove there is no rollover at compile time (or at least be darn confident!)