Parallel for loop in openmp Parallel for loop in openmp multithreading multithreading

Parallel for loop in openmp


You should make use of the OpenMP reduction clause for x and y:

#pragma omp parallel for reduction(+:x,y)for (int j=0; j<n; j++) {    double r=0.0;    for (int i=0; i < m; i++){        double rand_g1 = cos(i/double(m));        double rand_g2 = sin(i/double(m));             x += rand_g1;        y += rand_g2;        r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);    }    shifts[j] = r / m;}

With reduction each thread accumulates its own partial sum in x and y and in the end all partial values are summed together in order to obtain the final values.

Serial version:25.05s user 0.01s system 99% cpu 25.059 totalOpenMP version w/ OMP_NUM_THREADS=16:24.76s user 0.02s system 1590% cpu 1.559 total

See - superlinear speed-up :)


let's try to understand how parallelize simple for loop using OpenMP

#pragma omp parallel#pragma omp for    for(i = 1; i < 13; i++)    {       c[i] = a[i] + b[i];    }

assume that we have 3 available threads, this is what will happen

enter image description here

firstly

  • Threads are assigned an independent set of iterations

and finally

  • Threads must wait at the end of work-sharing construct


Because this question is highly viewed I decided to add a bit a OpenMP background to help those visiting it


The #pragma omp parallel creates a parallel region with a team of threads, where each thread executes the entire block of code that the parallel region encloses.From the OpenMP 5.1 one can read a more formal description :

When a thread encounters a parallel construct, a team of threads iscreated to execute the parallel region (..). Thethread that encountered the parallel construct becomes the primarythread of the new team, with a thread number of zero for the durationof the new parallel region. All threads in the new team, including theprimary thread, execute the region. Once the team is created, thenumber of threads in the team remains constant for the duration ofthat parallel region.

The #pragma omp parallel for creates a parallel region (as described before), and to the threads of that region the iterations of the loop that it encloses will be assigned, using the default chunk size, and the default schedule which is typically static. Bear in mind, however, that the default schedule might differ among different concrete implementation of the OpenMP standard.

From the OpenMP 5.1 you can read a more formal description :

The worksharing-loop construct specifies that the iterations of one ormore associated loops will be executed in parallel by threads in theteam in the context of their implicit tasks. The iterations aredistributed across threads that already exist in the team that isexecuting the parallel region to which the worksharing-loop regionbinds.

Moreover,

The parallel loop construct is a shortcut for specifying a parallelconstruct containing a loop construct with one or more associatedloops and no other statements.

Or informally, #pragma omp parallel for is a combination of the constructor #pragma omp parallel with #pragma omp for. In your case, this would mean that:

#pragma omp parallel for for (int j=0; j<n; j++) {    double r=0.0;    for (int i=0; i < m; i++){        double rand_g1 = cos(i/double(m));        double rand_g2 = sin(i/double(m));             x += rand_g1;        y += rand_g2;        r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);    }    shifts[j] = r / m;}

A team of threads will be created, and to each of those threads will be assigned chunks of the iterations of the outermost loop.

To make it more illustrative, with 4 threads the #pragma omp parallel for with a chunk_size=1 and a static schedule would result in something like:

enter image description here

Code-wise the loop would be transformed to something logically similar to:

for(int i=omp_get_thread_num(); i < n; i+=omp_get_num_threads()){      c[i]=a[i]+b[i];}

where omp_get_thread_num()

The omp_get_thread_num routine returns the thread number, within thecurrent team, of the calling thread.

and omp_get_num_threads()

Returns the number of threads in the current team. In a sequentialsection of the program omp_get_num_threads returns 1.

or in other words, for(int i = THREAD_ID; i < n; i += TOTAL_THREADS). With THREAD_ID ranging from 0 to TOTAL_THREADS - 1, and TOTAL_THREADS representing the total number of threads of the team created on the parallel region.

Armed with this knowledge, and looking at your code, one can see that you have a race-condition on the updates of the variables 'x' and 'y'. Those variables are shared among threads and update inside the parallel region, namely:

     x += rand_g1;     y += rand_g2;

To solve this race-condition you can use OpenMP' reduction clause:

Specifies that one or more variables that are private to each threadare the subject of a reduction operation at the end of the parallelregion.

Informally, the reduction clause, will create for each thread a private copy of the variables 'x' and 'y', and at the end of the parallel region perform the summation among all those 'x' and 'y' variables into the original 'x' and 'y' variables from the initial thread.

#pragma omp parallel for reduction(+:x,y)for (int j=0; j<n; j++) {    double r=0.0;    for (int i=0; i < m; i++){        double rand_g1 = cos(i/double(m));        double rand_g2 = sin(i/double(m));             x += rand_g1;        y += rand_g2;        r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);    }    shifts[j] = r / m;}