Why is thread local storage so slow?

multithreading performance d thread-local-storage

The speed depends on the TLS implementation.

Yes, you are correct that TLS can be as fast as a pointer lookup. It can even be faster on systems with a memory management unit.

For the pointer lookup you need help from the scheduler though. The scheduler must - on a task switch - update the pointer to the TLS data.

Another fast way to implement TLS is via the Memory Management Unit. Here the TLS is treated like any other data with the exception that TLS variables are allocated in a special segment. The scheduler will - on task switch - map the correct chunk of memory into the address space of the task.

If the scheduler does not support any of these methods, the compiler/library has to do the following:

get current ThreadId
Take a semaphore
Lookup the pointer to the TLS block by the ThreadId (may use a map or so)
Release the semaphore
Return that pointer.

Obviously doing all this for each TLS data access takes a while and may need up to three OS calls: Getting the ThreadId, Take and Release the semaphore.

The semaphore is btw required to make sure no thread reads from the TLS pointer list while another thread is in the middle of spawning a new thread. (and as such allocate a new TLS block and modify the datastructure).

Unfortunately it's not uncommon to see the slow TLS implementation in practice.

multithreading performance d thread-local-storage

Thread locals in D are really fast. Here are my tests.

64 bit Ubuntu, core i5, dmd v2.052Compiler options: dmd -O -release -inline -m64

// this loop takes 0m0.630svoid main(){    int a; // register allocated    for( int i=1000*1000*1000; i>0; i-- ){        a+=9;    }}// this loop takes 0m1.875sint a; // thread local in D, not staticvoid main(){    for( int i=1000*1000*1000; i>0; i-- ){        a+=9;    }}

So we lose only 1.2 seconds of one of CPU's cores per 1000*1000*1000 thread local accesses.Thread locals are accessed using %fs register - so there is only a couple of processor commands involved:

Disassembling with objdump -d:

- this is local variable in %ecx register (loop counter in %eax):   8:   31 c9                   xor    %ecx,%ecx   a:   b8 00 ca 9a 3b          mov    $0x3b9aca00,%eax   f:   83 c1 09                add    $0x9,%ecx  12:   ff c8                   dec    %eax  14:   85 c0                   test   %eax,%eax  16:   75 f7                   jne    f <_Dmain+0xf>- this is thread local, %fs register is used for indirection, %edx is loop counter:   6:   ba 00 ca 9a 3b          mov    $0x3b9aca00,%edx   b:   64 48 8b 04 25 00 00    mov    %fs:0x0,%rax  12:   00 00   14:   48 8b 0d 00 00 00 00    mov    0x0(%rip),%rcx        # 1b <_Dmain+0x1b>  1b:   83 04 08 09             addl   $0x9,(%rax,%rcx,1)  1f:   ff ca                   dec    %edx  21:   85 d2                   test   %edx,%edx  23:   75 e6                   jne    b <_Dmain+0xb>

Maybe compiler could be even more clever and cache thread local before loop to a registerand return it to thread local at the end (it's interesting to compare with gdc compiler),but even now matters are very good IMHO.

multithreading performance d thread-local-storage

One needs to be very careful in interpreting benchmark results. For example, a recent thread in the D newsgroups concluded from a benchmark that dmd's code generation was causing a major slowdown in a loop that did arithmetic, but in actuality the time spent was dominated by the runtime helper function that did long division. The compiler's code generation had nothing to do with the slowdown.

To see what kind of code is generated for tls, compile and obj2asm this code:

__thread int x;int foo() { return x; }

TLS is implemented very differently on Windows than on Linux, and will be very different again on OSX. But, in all cases, it will be many more instructions than a simple load of a static memory location. TLS is always going to be slow relative to simple access. Accessing TLS globals in a tight loop is going to be slow, too. Try caching the TLS value in a temporary instead.

I wrote some thread pool allocation code years ago, and cached the TLS handle to the pool, which worked well.

CodeHunter

Why is thread local storage so slow?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last