How do the C++ STL (ExecutionPolicy) algorithms determine how many parallel threads to use? How do the C++ STL (ExecutionPolicy) algorithms determine how many parallel threads to use? multithreading multithreading

How do the C++ STL (ExecutionPolicy) algorithms determine how many parallel threads to use?


Most of these questions can not be answered by the standard as of today. However, your question, as I understand it, mixes two concepts:

C1. Constraints on parallel algorithms

C2. Execution of algorithms

All the C++17 parallel STL thing is about C1: it sets constraints on how instructions and/or threads could be interleaved/transformed in a parallel computation. On the other hand, C2 is about being standardized, the keyword is executor (more on this later).

For C1, there are 3 standard policies (in std::execution::seq, par and par_unseq) that correspond to every combination of task and instruction parallelism. For example, when performing an integer accumulation, par_unseq could be used, since the order is not important. However, for float point arithmetic, where addition is not associative, a better fit would be seq to, at least, get a deterministic result. In short: policies set constraints on parallel computation and these constraints could be potentially exploited by a smart compiler.

On the other hand, once you have a parallel algorithm and its constraints (and possibly after some optimization/transformation), the executor will find a way to execute it. There are default executors (for CPU for example) or you can create your own, then, all that configuration regarding number of threads, workload, processing unit, etc... can be set.

As of today, C1 is in the standard, but not C2, so if you use C1 with a compliant compiler, you will not be able to specify which execution profile you want and the library implementation will decide for you (maybe through extensions).

So, to address your questions:

(Regarding your first 5 questions) By definition, C++17 parallel STL library does not define any computation, just data dependency, in order to allow for possible data flow transformations. All these questions will be answered (hopefully) by executor, you can see the current proposal here. It will look something like:

executor = get_executor();sort( std::execution::par.on(executor), vec.begin(), vec.end());

Some of your questions are already defined in that proposal.

(For the 6th) There are a number of libraries out there that already implement similar concepts (C++ executor was inspired by some of them indeed), AFAIK: hpx, Thrust or Boost.Compute. I do not know how the last two are actually implemented, but for hpx they use lightweight threads and you can configure execution profile. Also, the expected (not yet standardized) syntax of the code above for C++17 is essentially the same as in (was heavily inspired by) hpx.

References:

  1. C++17 Parallel Algorithms and Beyond by Bryce Adelstein lelbach
  2. The future of ISO C++ Heterogeneous Computing by Michael Wong
  3. Keynote C++ executors to enable heterogeneous computing in tomorrow's C++ today by Michael Wong
  4. Executors for C++ - A Long Story by Detlef Vollmann


Pre-final C++17 draft tells nothing about "how to implement the multi-threaded algorithms", that's true. Implementation owners decide on their own how to do that. E.g. Parallel STL uses TBB as a threading back-end and OpenMP as a vectorization back-end. I guess that to find out how does this implementation matches your machine - you need to read the implementation-specific documentation