.NET: ThreadStatic vs lock { }. Why ThreadStaticAttribute degrades performance? .NET: ThreadStatic vs lock { }. Why ThreadStaticAttribute degrades performance? multithreading multithreading

.NET: ThreadStatic vs lock { }. Why ThreadStaticAttribute degrades performance?


For RELEASE build there seems to be almost no [ThreadStatic] performance penalty (only slight penalty on modern CPUs).

Here comes dis-assembly code for ms_Acc += one; for RELEASE optimization is enabled:

No [ThreadStatic], DEBUG:

00000060  mov         eax,dword ptr [ebp-40h] 00000063  add         dword ptr ds:[00511718h],eax 

No [ThreadStatic], RELEASE:

00000051  mov         eax,dword ptr [00040750h]00000057  add         eax,dword ptr [rsp+20h]0000005b  mov         dword ptr [00040750h],eax

[ThreadStatic], DEBUG:

00000066  mov         edx,1 0000006b  mov         ecx,4616E0h 00000070  call        664F7450 00000075  mov         edx,1 0000007a  mov         ecx,4616E0h 0000007f  mov         dword ptr [ebp-50h],eax 00000082  call        664F7450 00000087  mov         edx,dword ptr [eax+18h] 0000008a  add         edx,dword ptr [ebp-40h] 0000008d  mov         eax,dword ptr [ebp-50h] 00000090  mov         dword ptr [eax+18h],edx 

[ThreadStatic], RELEASE:

00000058  mov         edx,1 0000005d  mov         rcx,7FF001A3F28h 00000067  call        FFFFFFFFF6F9F740 0000006c  mov         qword ptr [rsp+30h],rax 00000071  mov         rbx,qword ptr [rsp+30h] 00000076  mov         ebx,dword ptr [rbx+20h] 00000079  add         ebx,dword ptr [rsp+20h] 0000007d  mov         edx,1 00000082  mov         rcx,7FF001A3F28h 0000008c  call        FFFFFFFFF6F9F740 00000091  mov         qword ptr [rsp+38h],rax 00000096  mov         rax,qword ptr [rsp+38h] 0000009b  mov         dword ptr [rax+20h],ebx 


You have two lines of code that update ms_Acc. In the lock case, you have a single lock around both of these, while in the ThreadStatic case, it happens once for each access to ms_Acc, i.e. twice for each iteration of your loop. This is generally the benefit of using lock, you get to choose the granularity you want. I am guessing that the RELEASE build optimised this difference away.

I would be interested to see if the performance becomes very similar, or identical, if you change the for loop to a single access to ms_Acc.