Open-MP Parallel for (three-dimensional array)

Question 1

We are working with the following code:

int i, j, k;
for (i = 2; i < n; i++){ // S1
 for (j = 3; j < n - 3; j++){ // S2
 for (k = 4; k < n - 4; k++){ // S3
 A[i][j][k] = // T1
 A[i+1][j-3][k-4] + // T2
 A[i-1][j][k+2] + // T3
 A[i][j][k-1]; // T4
 }
 }
}

The objective is to parallelise as many loops as possible and explain why it's not possible to parallelise the others.

From a theoretical standpoint, when analysing the code using the Bernstein Conditions (see image)Berstein Condition, we came to the conclusion that only S2 can be parallelised while S1 and S3 have dependencies.

However, when testing with Open-MP, we saw a significant speedup of the runtime when trying to parallelise S1 or S2 (meaning both of them should be parallelisable). Furthermore, when trying to mess with the indices in the different terms (T1 - T4), we found that S1 can be parallelised every time even when the Bernstein Conditions wouldn't be met and S3 can never be parallelised even when the Bernstein Conditions are met.

The question is: how exactly do the Open-MP pragmas work and in which case can we expect to see a speedup. Are the Bernstein Conditions not directly linked to them? Which theoretical construct could be used to test parallelisability using Open-MP?

When parallelising S1 (using #pragma omp parallel for shared(A, n) private (i, j, k)) we saw a significant speedup in runtime (at least twice as fast) even though the Bernstein Conditions are not met.

When parallelising S2 (using #pragma omp parallel for shared(A, n) private (j, k)) we got a similar result but this was expected since the Bernstein Conditions are met for S2.

When parallelising S3 (using #pragma omp parallel for shared(A, n), private(k)) the runtime exploded leading us to conjecture that S3 was not parallelisable and Open-MP had to artificially introduce barriers so that the result wouldn't be different.

We then also changed the calculation to: A[i][j][k] = A[i-1][j][k] + A[i-1][j][k+2] + A[i][j][k-1];which should obviously break the Bernstein Conditions for S1. However, we found that when trying to parallelise S1 with the same pragma as above, the code still executed twice as fast without producing a different result as a serial implementation which would suggest that S1 is still parallelisable.

Similarly, we changed the calculation again to: A[i][j][k] = A[i+1][j-3][k-4] + A[i-1][j][k+2] + A[i+2][j-3][k-1];to meet the Bernstein Conditions for S3. However, we found that when trying to parallelise S3 with the same pragma as above, the code still executed about 10 times slower than a serial implementation which would suggest that S3 still isn't parallelisable.

Question 2

Getting the correct result doesn't necessarily mean that the code is safe and correct. Race conditions can show up in results sporadically and cause UB in c. Parallelising a loop comes with an overhead which can cause small loops to run slower, but that doesn't mean the loop isn't parallelisable. At a glance, your example probably isn't easily parallelisable with openmp pragmas. How big is n typically?

Question 3

What does parallelizing "as many loops as possible" even mean? Are you aware that it is not, generally, advantageous to structure multiple loops in the same nest with separate OpenMP parallel loop constructs?

Question 4

Please use for ( int i=1;..... syntax to prevent mishaps with shared variables.

Question 5

Note that the results you obtain will differ from the results you would obtain by storing everything in a new array, then copying the new array back to the old array. The reason is that the current iteration refers to values that were set on an earlier iteration, such as A[i-1][j][k+2] and A[i][j][k-1]. Is that intentional?

Question 6

TL;DR: measured absolute or relative execution time does not tell you anything about the correctness of a parallel (or serial) program.

The question is: how exactly do the Open-MP pragmas work and in which case can we expect to see a speedup.

The OpenMP specifications describe in great detail the semantics of the OpenMP pragmas. For a more specific answer, you need to ask a more specific question.

In which cases you can expect to see a speedup is a much more difficult question, and not particularly specific to OpenMP. This is the regime of Amdahl's law, which tells you that the overall speedup depends in part on how much of the program's work is actually performed in parallel.

It also depends on how much overhead the parallelization adds to the computation, where overhead comes from work and delays such as

creating new threads
context switching
synchronization delays

The general wisdom is that for computationally bound workloads, splitting the work among more threads than you have compute units is not helpful, and for a given number of threads, splitting the the work more finely than is necessary is counterproductive.

Note well that a correct parallel program can consume more wall time than a corresponding correct serial program, and the parallel version almost always consumes more total CPU time.

Are the Bernstein Conditions not directly linked to them? Which theoretical construct could be used to test parallelisability using Open-MP?

The Bernstein conditions are about whether a parallel computation is (or can be) correct, not about whether it produces speedup. OpenMP does not automatically provide any memory protection that would moot the Bernstein conditions.

Which theoretical construct could be used to test parallelisability using Open-MP?

You cannot reliably test for parallelizability in the sense of performing an experiment. That is a matter of algorithmic and code analysis. What you can meaningfully test is differences in speed among correct serial and parallel implementations of the same computation. How fast an incorrect program runs is not an interesting data point -- not even if it happens to or seems to produce a result that you consider correct.

When parallelising S1 (using #pragma omp parallel for shared(A, n) private (i, j, k)) we saw a significant speedup in runtime (at least twice as fast) even though the Bernstein Conditions are not met.

Irrelevant, because the program described is incorrect. In particular, the specs say this:

Two memory operations are considered unordered if the order in which they must complete, as seen by their affected threads, is not specified by the memory consistency guarantees listed in Section 1.4.6. If multiple threads write to the same memory unit (defined consistently with the above access considerations) then a data race occurs if the writes are unordered. Similarly, if at least one thread reads from a memory unit and at least one thread writes to that same memory unit then a data race occurs if the read and write are unordered. If a data race occurs then the result of the program is unspecified.

That's essentially the Bernstein conditions. By all means, read the details if you like, but the code you describe does not engage any of the memory consistency mechanisms discussed in section 1.4.6.

When parallelising S2 (using #pragma omp parallel for shared(A, n) private (j, k)) we got a similar result but this was expected since the Bernstein Conditions are met for S2.

That appears as if it would yield a correct program. Again, however, that's an entirely separate question from whether you observe it to complete faster than another version of the program.

When parallelising S3 (using #pragma omp parallel for shared(A, n), private(k)) the runtime exploded leading us to conjecture that S3 was not parallelisable and Open-MP had to artificially introduce barriers so that the result wouldn't be different.

OpenMP does introduce a barrier at the end of each parallel region, but that doesn't make the program you describe correct. The dramatic increase in runtime does not speak to whether the program is parallelizable in the Bernstein sense or any other. It is mainly a consequence of making the tasks executed in parallel too fine grained, such that the parallelization overhead swamps any gains from parallel execution.

We then also changed the calculation to: [...]

Nothing in the above analysis much changes with the changes described to the computation.

Question 7

Your code requires substantial rewriting. As you correctly observe, the j loop has j depending only on j-3, so you the j space can be divided in 3 sequential sections, starting at 3,4,5 respectively. So you can get a factor of 3 speedup.

Question 8

To reduce the number of parallel region instances and therefore the number of barriers, you should always try to parallelize the outermost loop. As you already observed, only the S2 loop can be parallelized. You can make S2 the outermost loop by swapping S1 and S2. As long as you make sure that only one thread works on the j%3 iterations, parallelizing the code is trivial (without changing the code as suggested by Victor):

#pragma omp parallel for schedule(static, 1) num_threads(3)
 for (int j = 3; j < n - 3; j++) { // S2
 for (int i = 2; i < n; i++) { // S1
 for (int k = 4; k < n - 4; k++) { // S3
 A[i][j][k] = // T1
 A[i + 1][j - 3][k - 4] + // T2
 A[i - 1][j][k + 2] + // T3
 A[i][j][k - 1]; // T4
 }
 }
 }

Compiling the code with ThreadSanitizer (clang -g -O3 -fopenmp -fsanitize=thread test.c) and executing the binary will tell you that this version has no race:

env TSAN_OPTIONS='ignore_noninstrumented_modules=1' ./a.out

Limiting the number of threads to two will report data races due to different threads executing consecutive j%3 iterations

env TSAN_OPTIONS='ignore_noninstrumented_modules=1' OMP_THREAD_LIMIT=2 ./a.out

OpenMP 6.0 introduced the strict modifier, which would prevent the possibility of this data race by using num_threads(strict:3) instead of num_threads(3). The execution would abort if the requested number of threads cannot be provided. I'm not sure whether any compiler already supports the strict modifier.

Question 9

If there is a loop from 0 to 9 and exactly 3 threads, does OpenMP guarantee that the first thread gets to work on indices (0,3,6) and not (0,1,2)?

Question 10

static,1 defines a round-robin distribution of single iterations per thread. For a loop with iteration values 0 to 9, the first thread will execute iterations 0, 3, 6, and 9.

Question 11

Note that the data race you describe occurs in part because of the reordering of your loops. With the original loop nesting and parallelization of S2, that particular data race does not occur with any schedule or number of threads. That is a significant advantage. Although it's a good principle to prefer coarser-grained parallel sections, that should be weighed against other considerations, of which freedom from data races is among the most significant.

John Bollinger 192k11 gold badges103 silver badges210 bronze badges · Accepted Answer · 2025-08-06 14:31:38Z

TL;DR: measured absolute or relative execution time does not tell you anything about the correctness of a parallel (or serial) program.

The question is: how exactly do the Open-MP pragmas work and in which case can we expect to see a speedup.

The OpenMP specifications describe in great detail the semantics of the OpenMP pragmas. For a more specific answer, you need to ask a more specific question.

In which cases you can expect to see a speedup is a much more difficult question, and not particularly specific to OpenMP. This is the regime of Amdahl's law, which tells you that the overall speedup depends in part on how much of the program's work is actually performed in parallel.

It also depends on how much overhead the parallelization adds to the computation, where overhead comes from work and delays such as

creating new threads
context switching
synchronization delays

The general wisdom is that for computationally bound workloads, splitting the work among more threads than you have compute units is not helpful, and for a given number of threads, splitting the the work more finely than is necessary is counterproductive.

Note well that a correct parallel program can consume more wall time than a corresponding correct serial program, and the parallel version almost always consumes more total CPU time.

Are the Bernstein Conditions not directly linked to them? Which theoretical construct could be used to test parallelisability using Open-MP?

The Bernstein conditions are about whether a parallel computation is (or can be) correct, not about whether it produces speedup. OpenMP does not automatically provide any memory protection that would moot the Bernstein conditions.

Which theoretical construct could be used to test parallelisability using Open-MP?

You cannot reliably test for parallelizability in the sense of performing an experiment. That is a matter of algorithmic and code analysis. What you can meaningfully test is differences in speed among correct serial and parallel implementations of the same computation. How fast an incorrect program runs is not an interesting data point -- not even if it happens to or seems to produce a result that you consider correct.

When parallelising S1 (using #pragma omp parallel for shared(A, n) private (i, j, k)) we saw a significant speedup in runtime (at least twice as fast) even though the Bernstein Conditions are not met.

Irrelevant, because the program described is incorrect. In particular, the specs say this:

Two memory operations are considered unordered if the order in which they must complete, as seen by their affected threads, is not specified by the memory consistency guarantees listed in Section 1.4.6. If multiple threads write to the same memory unit (defined consistently with the above access considerations) then a data race occurs if the writes are unordered. Similarly, if at least one thread reads from a memory unit and at least one thread writes to that same memory unit then a data race occurs if the read and write are unordered. If a data race occurs then the result of the program is unspecified.

That's essentially the Bernstein conditions. By all means, read the details if you like, but the code you describe does not engage any of the memory consistency mechanisms discussed in section 1.4.6.

When parallelising S2 (using #pragma omp parallel for shared(A, n) private (j, k)) we got a similar result but this was expected since the Bernstein Conditions are met for S2.

That appears as if it would yield a correct program. Again, however, that's an entirely separate question from whether you observe it to complete faster than another version of the program.

When parallelising S3 (using #pragma omp parallel for shared(A, n), private(k)) the runtime exploded leading us to conjecture that S3 was not parallelisable and Open-MP had to artificially introduce barriers so that the result wouldn't be different.

OpenMP does introduce a barrier at the end of each parallel region, but that doesn't make the program you describe correct. The dramatic increase in runtime does not speak to whether the program is parallelizable in the Bernstein sense or any other. It is mainly a consequence of making the tasks executed in parallel too fine grained, such that the parallelization overhead swamps any gains from parallel execution.

We then also changed the calculation to: [...]

Nothing in the above analysis much changes with the changes described to the computation.

CollectivesTM on Stack Overflow

Open-MP Parallel for (three-dimensional array)

3 Answers 3

TL;DR: measured absolute or relative execution time does not tell you anything about the correctness of a parallel (or serial) program.

Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

TL;DR: measured absolute or relative execution time does not tell you anything about the correctness of a parallel (or serial) program.

Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related