Parallelizing a very tight loop Parallelizing a very tight loop multithreading multithreading

Parallelizing a very tight loop


You should use one of the Parallel.ForEach loops that has a local state.

Each seperate partition of a parallelized loop has a unique local state, which means it doesn't need synchronization. As a final action you have to aggregate every local state into the final value. This step requires synchronization but is only called once for every partition instead of once per iteration.

Instead of

Parallel.ForEach(    buffer,    parallelOptions,    thisByte => Interlocked.Increment(ref histocopy[thisByte]));

you can use

Parallel.ForEach(    buffer,    parallelOptions,    () => new int[histocopy.Length], // initialize local histogram    (thisByte, state, local) => local[thisByte]++, // increment local histogram    local =>    {        lock(histocopy) // add local histogram to global        {            for (int idx = 0; idx < histocopy.Length; idx++)            {                histocopy[idx] += local[idx];            }        }    }

It might also be a good idea to start with the default options for partition size and parallel options and optimize from there.


I don't have any experience with Parallel, but I whipped up a test with manual threading, and it works perfectly.

private class Worker{    public Thread Thread;    public int[] Accumulator = new int[256];    public int Start, End;    public byte[] Data;    public Worker( int start, int end, byte[] buf )    {        this.Start = start;        this.End = end;        this.Data = buf;        this.Thread = new Thread( Func );        this.Thread.Start();    }    public void Func()    {        for( int i = Start; i < End; i++ )            this.Accumulator[this.Data[i]]++;    }}int NumThreads = 8;int len = buf.Length / NumThreads;var workers = new Worker[NumThreads];for( int i = 0; i < NumThreads; i++ )    workers[i] = new Worker( i * len, i * len + len, buf );foreach( var w in workers )    w.Thread.Join();int[] accumulator = new int[256];for( int i = 0; i < workers.Length; i++ )    for( int j = 0; j < accumulator.Length; j++ )        accumulator[j] += workers[i].Accumulator[j];

Results on my Q720 mobile i7:

Single threaded time = 5.50s4 threads = 1.90s8 threads = 1.24s

Looks like it's working to me. And interestingly, even though the hyper-threading cores shares a cache, 8 threads was actually a bit faster than 4.


I have no ideas if this will be faster, but a little observation;

what if you sort all the elements in buffer[]? It would mean that there is no crossing between different cores anymore. If the performance is applicable, you can then increase core count, it should go up linearly. Note that you really need to handle the firstRange/secondRange splitting a little better since you don't want to have two elements with same values on different ranges.

private static void CalculateHistogram(uint[] histo, byte[] buffer){    Array.Sort(buffer); // so the indexes into histo play well with cache.       // todo; rewrite to handle edge-cases.    var firstRange = new[] {0, buffer.Length/2}; // [inclusive, exclusive]    var secondRange = new[] {buffer.Length/2, buffer.Length};    // create two tasks for now ;o    var tasks = new Task[2];    var taskIdentifier = 0;    foreach (var range in new[] {firstRange, secondRange})    {        var rangeFix = range; // lambda capture ;s        tasks[taskIdentifier++] = Task.Factory.StartNew(() =>        {            for (var i = rangeFix[0]; i < rangeFix[1]; i++)                ++histo[i];        });    }    Task.WaitAll(tasks);}

Quick googling shows me that you can use C# & GPU to sort numbers even further, which would lead to around 3x better performance, worth a try: http://adnanboz.wordpress.com/2011/07/27/faster-sorting-in-c-by-utilizing-gpu-with-nvidia-cuda/

Ps there's few tricks more that can bring very substantial performance gains:

1) Remember the concept of false cache sharing - http://msdn.microsoft.com/en-us/magazine/cc872851.aspx

2) Try using stackalloc keyword and make sure ANY memory allocation is done through stack. Trust me on this - any memory allocation is mad slow, unless directly from stack. We are talking about 5x differences.

3) You can use C# MONO SIMD to try and SUM different arrays(this is C version, but the concept applies to C# C++ Adding 2 arrays together quickly)