Using ThreadStatic to replace expensive locals -- good idea? Using ThreadStatic to replace expensive locals -- good idea? multithreading multithreading

Using ThreadStatic to replace expensive locals -- good idea?


[ThreadStatic] is no free lunch. Every access to the variable needs to go through a helper function in the CLR (JIT_GetThreadFieldAddr_Primitive/Objref) instead of being compiled inline by the jitter. It also isn't a true substitute for a local variable, recursion is going to byte. You really have to profile this yourself, guesstimating perf with that much CLR code in the loop isn't feasible.


I have carried out a simple benchmark and ThreadStatic performs better for the simple parameters described in the question.

As with many algorithms which have a high number of iterations, I suspect it is a straightforward case of GC overhead killing it for the version which allocates new arrays:

Update

With tests that include an added iteration of the array to model minimal array reference use, plus ThreadStatic array reference usage in addition to previous test where reference was copied local:

Iterations : 10,000,000Local ArrayRef          (- array iteration) : 330.17msLocal ArrayRef          (- array iteration) : 327.03msLocal ArrayRef          (- array iteration) : 1382.86msLocal ArrayRef          (- array iteration) : 1425.45msLocal ArrayRef          (- array iteration) : 1434.22msTS    CopyArrayRefLocal (- array iteration) : 107.64msTS    CopyArrayRefLocal (- array iteration) : 92.17msTS    CopyArrayRefLocal (- array iteration) : 92.42msTS    CopyArrayRefLocal (- array iteration) : 92.07msTS    CopyArrayRefLocal (- array iteration) : 92.10msLocal ArrayRef          (+ array iteration) : 1740.51msLocal ArrayRef          (+ array iteration) : 1647.26msLocal ArrayRef          (+ array iteration) : 1639.80msLocal ArrayRef          (+ array iteration) : 1639.10msLocal ArrayRef          (+ array iteration) : 1646.56msTS    CopyArrayRefLocal (+ array iteration) : 368.03msTS    CopyArrayRefLocal (+ array iteration) : 367.19msTS    CopyArrayRefLocal (+ array iteration) : 367.22msTS    CopyArrayRefLocal (+ array iteration) : 368.20msTS    CopyArrayRefLocal (+ array iteration) : 367.37msTS    TSArrayRef        (+ array iteration) : 360.45msTS    TSArrayRef        (+ array iteration) : 359.97msTS    TSArrayRef        (+ array iteration) : 360.48msTS    TSArrayRef        (+ array iteration) : 360.03msTS    TSArrayRef        (+ array iteration) : 359.99ms

Code:

[ThreadStatic]private static int[] _array;[Test]public object measure_thread_static_performance(){    const int TestIterations = 5;    const int Iterations = (10 * 1000 * 1000);    const int ArraySize = 50;    Action<string, Action> time = (name, test) =>    {        for (int i = 0; i < TestIterations; i++)        {            TimeSpan elapsed = TimeTest(test, Iterations);            Console.WriteLine("{0} : {1:F2}ms", name, elapsed.TotalMilliseconds);        }    };    int[] array = null;    int j = 0;    Action test1 = () =>    {        array = new int[ArraySize];    };    Action test2 = () =>    {        array = _array ?? (_array = new int[ArraySize]);    };    Action test3 = () =>    {        array = new int[ArraySize];        for (int i = 0; i < ArraySize; i++)        {            j = array[i];        }    };    Action test4 = () =>    {        array = _array ?? (_array = new int[ArraySize]);        for (int i = 0; i < ArraySize; i++)        {            j = array[i];        }    };    Action test5 = () =>    {        array = _array ?? (_array = new int[ArraySize]);        for (int i = 0; i < ArraySize; i++)        {            j = _array[i];        }    };    Console.WriteLine("Iterations : {0:0,0}\r\n", Iterations);    time("Local ArrayRef          (- array iteration)", test1);    time("TS    CopyArrayRefLocal (- array iteration)", test2);    time("Local ArrayRef          (+ array iteration)", test3);    time("TS    CopyArrayRefLocal (+ array iteration)", test4);    time("TS    TSArrayRef        (+ array iteration)", test5);    Console.WriteLine(j);    return array;}[SuppressMessage("Microsoft.Reliability", "CA2001:AvoidCallingProblematicMethods", MessageId = "System.GC.Collect")]private static TimeSpan TimeTest(Action action, int iterations){    Action gc = () =>    {        GC.Collect();        GC.WaitForFullGCComplete();    };    Action empty = () => { };    Stopwatch stopwatch1 = Stopwatch.StartNew();    for (int j = 0; j < iterations; j++)    {        empty();    }    TimeSpan loopElapsed = stopwatch1.Elapsed;    gc();    action(); //JIT    action(); //Optimize    Stopwatch stopwatch2 = Stopwatch.StartNew();    for (int j = 0; j < iterations; j++) action();    gc();    TimeSpan testElapsed = stopwatch2.Elapsed;    return (testElapsed - loopElapsed);}


From results like this, ThreadStatic looks pretty fast. I'm not sure that anybody has a specific answer to if it's faster then reallocating a 50 element array though. That's the kind of thing you'll have to benchmark yourself. :)

I'm somewhat torn on if it's a "good idea" or not. So long as all the implementation details are kept inside the class it's not necessarily a bad idea (you really don't want the caller to have to worry about it), but unless benchmarks showed a performance gain from this method I would stick to simply allocating the array each time because it makes the code simpler and easier to read. As the more complicated of the two solutions, I'd need to see some benefit from the complexity before choosing this one.