Parallelizing a very tight loop

c# multithreading performance parallel-processing parallel.foreach

You should use one of the Parallel.ForEach loops that has a local state.

Each seperate partition of a parallelized loop has a unique local state, which means it doesn't need synchronization. As a final action you have to aggregate every local state into the final value. This step requires synchronization but is only called once for every partition instead of once per iteration.

Instead of

Parallel.ForEach(    buffer,    parallelOptions,    thisByte => Interlocked.Increment(ref histocopy[thisByte]));

you can use

Parallel.ForEach(    buffer,    parallelOptions,    () => new int[histocopy.Length], // initialize local histogram    (thisByte, state, local) => local[thisByte]++, // increment local histogram    local =>    {        lock(histocopy) // add local histogram to global        {            for (int idx = 0; idx < histocopy.Length; idx++)            {                histocopy[idx] += local[idx];            }        }    }

It might also be a good idea to start with the default options for partition size and parallel options and optimize from there.

c# multithreading performance parallel-processing parallel.foreach

I don't have any experience with Parallel, but I whipped up a test with manual threading, and it works perfectly.

private class Worker{    public Thread Thread;    public int[] Accumulator = new int[256];    public int Start, End;    public byte[] Data;    public Worker( int start, int end, byte[] buf )    {        this.Start = start;        this.End = end;        this.Data = buf;        this.Thread = new Thread( Func );        this.Thread.Start();    }    public void Func()    {        for( int i = Start; i < End; i++ )            this.Accumulator[this.Data[i]]++;    }}int NumThreads = 8;int len = buf.Length / NumThreads;var workers = new Worker[NumThreads];for( int i = 0; i < NumThreads; i++ )    workers[i] = new Worker( i * len, i * len + len, buf );foreach( var w in workers )    w.Thread.Join();int[] accumulator = new int[256];for( int i = 0; i < workers.Length; i++ )    for( int j = 0; j < accumulator.Length; j++ )        accumulator[j] += workers[i].Accumulator[j];

Results on my Q720 mobile i7:

Single threaded time = 5.50s4 threads = 1.90s8 threads = 1.24s

Looks like it's working to me. And interestingly, even though the hyper-threading cores shares a cache, 8 threads was actually a bit faster than 4.

c# multithreading performance parallel-processing parallel.foreach

I have no ideas if this will be faster, but a little observation;

what if you sort all the elements in buffer[]? It would mean that there is no crossing between different cores anymore. If the performance is applicable, you can then increase core count, it should go up linearly. Note that you really need to handle the firstRange/secondRange splitting a little better since you don't want to have two elements with same values on different ranges.

private static void CalculateHistogram(uint[] histo, byte[] buffer){    Array.Sort(buffer); // so the indexes into histo play well with cache.       // todo; rewrite to handle edge-cases.    var firstRange = new[] {0, buffer.Length/2}; // [inclusive, exclusive]    var secondRange = new[] {buffer.Length/2, buffer.Length};    // create two tasks for now ;o    var tasks = new Task[2];    var taskIdentifier = 0;    foreach (var range in new[] {firstRange, secondRange})    {        var rangeFix = range; // lambda capture ;s        tasks[taskIdentifier++] = Task.Factory.StartNew(() =>        {            for (var i = rangeFix[0]; i < rangeFix[1]; i++)                ++histo[i];        });    }    Task.WaitAll(tasks);}

Quick googling shows me that you can use C# & GPU to sort numbers even further, which would lead to around 3x better performance, worth a try: http://adnanboz.wordpress.com/2011/07/27/faster-sorting-in-c-by-utilizing-gpu-with-nvidia-cuda/

Ps there's few tricks more that can bring very substantial performance gains:

1) Remember the concept of false cache sharing - http://msdn.microsoft.com/en-us/magazine/cc872851.aspx

2) Try using stackalloc keyword and make sure ANY memory allocation is done through stack. Trust me on this - any memory allocation is mad slow, unless directly from stack. We are talking about 5x differences.

3) You can use C# MONO SIMD to try and SUM different arrays(this is C version, but the concept applies to C# C++ Adding 2 arrays together quickly)

CodeHunter

Parallelizing a very tight loop

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last