TParallel.For performance TParallel.For performance multithreading multithreading

TParallel.For performance


The key for this problem is correct partitioning and sharing as little as possible.

With this code it runs almost 4 times faster than the serial one.

const   WorkerCount = 4;function GetWorker(index: Integer; const oddsArr: TArray<Integer>): TProc;var  min, max: Integer;begin  min := MaxArr div WorkerCount * index;  if index + 1 < WorkerCount then    max := MaxArr div WorkerCount * (index + 1) - 1  else    max := MaxArr - 1;  Result :=    procedure    var      i: Integer;      odds: Integer;    begin      odds := 0;      for i := min to max do        if Odd(ArrXY[i]) then          Inc(odds);      oddsArr[index] := odds;    end;end;procedure Parallel;var  i: Integer;  oddsArr: TArray<Integer>;  workers: TArray<ITask>;begin  odds := 0;  Ticks := TThread.GetTickCount;  SetLength(oddsArr, WorkerCount);  SetLength(workers, WorkerCount);  for i := 0 to WorkerCount-1 do    workers[i] := TTask.Run(GetWorker(i, oddsArr));  TTask.WaitForAll(workers);  for i := 0 to WorkerCount-1 do    Inc(odds, oddsArr[i]);  Ticks := TThread.GetTickCount - Ticks;  writeln('Parallel: ' + Ticks.ToString + 'ms, odds: ' + odds.ToString);end;

You can write similar code with the TParallel.For but it still runs a bit slower (like 3 times faster than serial) than just using TTask.

Btw I used the function to return the worker TProc to get the index capturing right. If you run it in a loop in the same routine you capture the loop variable.

Update 19.12.2014:

Since we found out the critical thing is correct partitioning this can be put into a parallel for loop really easily without locking it on a particular data structure:

procedure ParallelFor(lowInclusive, highInclusive: Integer;  const iteratorRangeEvent: TProc<Integer, Integer>);  procedure CalcPartBounds(low, high, count, index: Integer;    out min, max: Integer);  var    len: Integer;  begin    len := high - low + 1;    min := (len div count) * index;    if index + 1 < count then      max := len div count * (index + 1) - 1    else      max := len - 1;  end;  function GetWorker(const iteratorRangeEvent: TProc<Integer, Integer>;    min, max: Integer): ITask;  begin    Result := TTask.Run(      procedure      begin        iteratorRangeEvent(min, max);      end)  end;var  workerCount: Integer;  workers: TArray<ITask>;  i, min, max: Integer;begin  workerCount := TThread.ProcessorCount;  SetLength(workers, workerCount);  for i := 0 to workerCount - 1 do  begin    CalcPartBounds(lowInclusive, highInclusive, workerCount, i, min, max);    workers[i] := GetWorker(iteratorRangeEvent, min, max);  end;  TTask.WaitForAll(workers);end;procedure Parallel4;begin  odds := 0;  Ticks := TThread.GetTickCount;  ParallelFor(0, MaxArr-1,    procedure(min, max: Integer)    var      i, n: Integer;    begin      n := 0;      for i := min to max do        if Odd(ArrXY[i]) then          Inc(n);      AtomicIncrement(odds, n);    end);  Ticks := TThread.GetTickCount - Ticks;  writeln('ParallelEx: Stefan Glienke ' + Ticks.ToString + ' ms, odds: ' + odds.ToString);end;

The key thing is to use a local variable for the counting and only at the end use the shared variable one time to add the sub total.


With OmniThreadLibrary from the SVN (this is not yet including in any official release), you can write this in a way which doesn't require interlocked access to the shared counter.

function CountParallelOTL: integer;var  counters: array of integer;  numCores: integer;  i: integer;begin  numCores := Environment.Process.Affinity.Count;  SetLength(counters, numCores);  FillChar(counters[0], Length(counters) * SizeOf(counters[0]), 0);  Parallel.For(0, MaxArr - 1)    .NumTasks(numCores)    .Execute(      procedure(taskIndex, value: integer)      begin        if Odd(ArrXY[value]) then          Inc(counters[taskIndex]);      end);  Result := counters[0];  for i := 1 to numCores - 1 do    Inc(Result, counters[i]);end;

This, however, is still at best on par with the sequential loop and at worst a few times slower.

I have compared this with Stefan's solution (XE7 tasks) and with a simple XE7 Parallel.For with interlocked increment (XE7 for).

Results from my notebook with 4 hyperthreaded cores:

Serial: 49999640 odd elements found in 543 ms

Parallel (OTL): 49999640 odd elements found in 555 ms

Parallel (XE7 tasks): 49999640 odd elements found in 136 ms

Parallel (XE7 for): 49999640 odd elements found in 1667 ms

Results from my workstation with 12 hyperthreaded cores:

Serial: 50005291 odd elements found in 685 ms

Parallel (OTL): 50005291 odd elements found in 1309 ms

Parallel (XE7 tasks): 50005291 odd elements found in 62 ms

Parallel (XE7 for): 50005291 odd elements found in 3379 ms

There's a big improvement over System.Threading Paralell.For because there's no interlocked increment but the handcrafted solution is much much faster.

Full test program:

program ParallelCount;{$APPTYPE CONSOLE}{$R *.res}uses  System.SyncObjs,  System.Classes,  System.SysUtils,  System.Threading,  DSiWin32,  OtlCommon,  OtlParallel;const  MaxArr = 100000000;var  Ticks: Cardinal;  i: Integer;  odds: Integer;  ArrXY: array of Integer;procedure FillArray;var  i: Integer;  j: Integer;begin  SetLength(ArrXY, MaxArr);  for i := 0 to MaxArr-1 do    ArrXY[i]:=Random(MaxInt);end;function CountSerial: integer;var  odds: integer;begin  odds := 0;  for i := 0 to MaxArr-1 do      if Odd(ArrXY[i]) then        Inc(odds);  Result := odds;end;function CountParallelOTL: integer;var  counters: array of integer;  numCores: integer;  i: integer;begin  numCores := Environment.Process.Affinity.Count;  SetLength(counters, numCores);  FillChar(counters[0], Length(counters) * SizeOf(counters[0]), 0);  Parallel.For(0, MaxArr - 1)    .NumTasks(numCores)    .Execute(      procedure(taskIndex, value: integer)      begin        if Odd(ArrXY[value]) then          Inc(counters[taskIndex]);      end);  Result := counters[0];  for i := 1 to numCores - 1 do    Inc(Result, counters[i]);end;function GetWorker(index: Integer; const oddsArr: TArray<Integer>; workerCount: integer): TProc;var  min, max: Integer;begin  min := MaxArr div workerCount * index;  if index + 1 < workerCount then    max := MaxArr div workerCount * (index + 1) - 1  else    max := MaxArr - 1;  Result :=    procedure    var      i: Integer;      odds: Integer;    begin      odds := 0;      for i := min to max do        if Odd(ArrXY[i]) then          Inc(odds);      oddsArr[index] := odds;    end;end;function CountParallelXE7Tasks: integer;var  i: Integer;  oddsArr: TArray<Integer>;  workers: TArray<ITask>;  workerCount: integer;begin  workerCount := Environment.Process.Affinity.Count;  odds := 0;  Ticks := TThread.GetTickCount;  SetLength(oddsArr, workerCount);  SetLength(workers, workerCount);  for i := 0 to workerCount-1 do    workers[i] := TTask.Run(GetWorker(i, oddsArr, workerCount));  TTask.WaitForAll(workers);  for i := 0 to workerCount-1 do    Inc(odds, oddsArr[i]);  Result := odds;end;function CountParallelXE7For: integer;var  odds: integer;begin  odds := 0;  TParallel.For(0,  MaxArr-1, procedure(I:Integer)  begin    if Odd(ArrXY[i]) then      TInterlocked.Increment(odds);  end);  Result := odds;end;procedure Count(const name: string; func: TFunc<integer>);var  time: int64;  cnt: integer;begin  time := DSiTimeGetTime64;  cnt := func();  time := DSiElapsedTime64(time);  Writeln(name, ': ', cnt, ' odd elements found in ', time, ' ms');end;begin  try    FillArray;    Count('Serial', CountSerial);    Count('Parallel (OTL)', CountParallelOTL);    Count('Parallel (XE7 tasks)', CountParallelXE7Tasks);    Count('Parallel (XE7 for)', CountParallelXE7For);    Readln;  except    on E: Exception do      Writeln(E.ClassName, ': ', E.Message);  end;end.


I think we discussed this before regarding OmniThreadLibrary. The main cause for the time being longer for the multithreaded solution is the overhead of TParallel.For compared to the time needed for the actual calculation.

A local variable won't be of any help here, while a global threadvar might solve the false sharing issue. Alas, you might not find a way to sum up all these treadvars after finishing the loop.

IIRC, the best approach is to chop the task in reasonable parts and work on a range of array entries for each iteration and increments a variable dedicated to that part. That alone won't solve the false sharing problem as that occurs even with distinct variables if they happen to be just part of the same cache line.

Another solution could be to write a class that handles a given slice of the array in a serial manner, act on multiple instances of this class in parallel and evaluate the results afterwards.

BTW: your code doesn't count the odds - it counts the evens.

And: there is a built-in function named Odd that usually is of better performance than the mod code you are using.