What is the performance penalty of C++11 thread_local variables in GCC 4.8?
(Disclaimer: I don't know much about the internals of GCC, so this is also an educated guess.)
The dynamic thread_local
initialization is added in commit 462819c. One of the change is:
* semantics.c (finish_id_expression): Replace use of thread_local
variable with a call to its wrapper.
So the run-time penalty is that, every reference of the thread_local
variable will become a function call. Let's check with a simple test case:
// 3.cppextern thread_local int tls; int main() { tls += 37; // line 6 tls &= 11; // line 7 tls ^= 3; // line 8 return 0;}// 4.cppthread_local int tls = 42;
When compiled*, we see that every use of the tls
reference becomes a function call to _ZTW3tls
, which lazily initialize the the variable once:
00000000004005b0 <main>:main(): 4005b0: 55 push rbp 4005b1: 48 89 e5 mov rbp,rsp 4005b4: e8 26 00 00 00 call 4005df <_ZTW3tls> // line 6 4005b9: 8b 10 mov edx,DWORD PTR [rax] 4005bb: 83 c2 25 add edx,0x25 4005be: 89 10 mov DWORD PTR [rax],edx 4005c0: e8 1a 00 00 00 call 4005df <_ZTW3tls> // line 7 4005c5: 8b 10 mov edx,DWORD PTR [rax] 4005c7: 83 e2 0b and edx,0xb 4005ca: 89 10 mov DWORD PTR [rax],edx 4005cc: e8 0e 00 00 00 call 4005df <_ZTW3tls> // line 8 4005d1: 8b 10 mov edx,DWORD PTR [rax] 4005d3: 83 f2 03 xor edx,0x3 4005d6: 89 10 mov DWORD PTR [rax],edx 4005d8: b8 00 00 00 00 mov eax,0x0 // line 9 4005dd: 5d pop rbp 4005de: c3 ret00000000004005df <_ZTW3tls>:_ZTW3tls(): 4005df: 55 push rbp 4005e0: 48 89 e5 mov rbp,rsp 4005e3: b8 00 00 00 00 mov eax,0x0 4005e8: 48 85 c0 test rax,rax 4005eb: 74 05 je 4005f2 <_ZTW3tls+0x13> 4005ed: e8 0e fa bf ff call 0 <tls> // initialize the TLS 4005f2: 64 48 8b 14 25 00 00 00 00 mov rdx,QWORD PTR fs:0x0 4005fb: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc 400602: 48 01 d0 add rax,rdx 400605: 5d pop rbp 400606: c3 ret
Compare it with the __thread
version, which won't have this extra wrapper:
00000000004005b0 <main>:main(): 4005b0: 55 push rbp 4005b1: 48 89 e5 mov rbp,rsp 4005b4: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc // line 6 4005bb: 64 8b 00 mov eax,DWORD PTR fs:[rax] 4005be: 8d 50 25 lea edx,[rax+0x25] 4005c1: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc 4005c8: 64 89 10 mov DWORD PTR fs:[rax],edx 4005cb: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc // line 7 4005d2: 64 8b 00 mov eax,DWORD PTR fs:[rax] 4005d5: 89 c2 mov edx,eax 4005d7: 83 e2 0b and edx,0xb 4005da: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc 4005e1: 64 89 10 mov DWORD PTR fs:[rax],edx 4005e4: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc // line 8 4005eb: 64 8b 00 mov eax,DWORD PTR fs:[rax] 4005ee: 89 c2 mov edx,eax 4005f0: 83 f2 03 xor edx,0x3 4005f3: 48 c7 c0 fc ff ff ff mov rax,0xfffffffffffffffc 4005fa: 64 89 10 mov DWORD PTR fs:[rax],edx 4005fd: b8 00 00 00 00 mov eax,0x0 // line 9 400602: 5d pop rbp 400603: c3 ret
This wrapper is not needed for in every use case of thread_local
though. This can be revealed from decl2.c
. The wrapper is generated only when:
It is not function-local, and,
- It is
extern
(the example shown above), or - The type has a non-trivial destructor (which is not allowed for
__thread
variables), or - The type variable is initialized by a non-constant-expression (which is also not allowed for
__thread
variables).
- It is
In all other use cases, it behaves the same as __thread
. That means, unless you have some extern __thread
variables, you could replace all __thread
by thread_local
without any loss of performance.
*: I compiled with -O0 because the inliner will make the function boundary less visible. Even if we turn up to -O3 those initialization checks still remain.
C++11 thread_local has the same runtime effect as the __thread specifier (__thread
is not part of the C standard; thread_local
is part of the C++ standard)
it depends where the TLS variable (declared with __thread
specifier) is declared.
- if TLS variable is declared in an executable then access is fast
- if TLS variable is declared within shared library code (compiled with
-fPIC
compiler option) and-ftls-model=initial-exec
compiler option is specified then access is fast; however the following limitation applies: the shared library can't be loaded via dlopen/dlsym (dynamic loading), the only way of using the library is to link with it during compilation (linker option-l<libraryname>
) - if TLS variable is declared within a shared library (
-fPIC
compiler option set) then access is very slow, as the general dynamic TLS model is assumed - here each access to a TLS variable results in a call to_tls_get_addr()
; this is the default case because you are not limited in the way that the shared library is used.
Sources: ELF Handling For Thread-Local Storage by Ulrich Drepper https://www.akkadia.org/drepper/tls.pdfthis text also lists the code that is generated for the supported target platforms.
If the variable is defined in the current TU, the inliner will take care of the overhead. I expect that this will be true of most uses of thread_local.
For extern variables, if the programmer can be sure that no use of the variable in a non-defining TU needs to trigger dynamic initialization (either because the variable is statically initialized, or a use of the variable in the defining TU will be executed before any uses in another TU), they can avoid this overhead with the -fno-extern-tls-init option.