We are working with the following code:
int i, j, k;
for (i = 2; i < n; i++){ // S1
for (j = 3; j < n - 3; j++){ // S2
for (k = 4; k < n - 4; k++){ // S3
A[i][j][k] = // T1
A[i+1][j-3][k-4] + // T2
A[i-1][j][k+2] + // T3
A[i][j][k-1]; // T4
}
}
}
The objective is to parallelise as many loops as possible and explain why it's not possible to parallelise the others.
From a theoretical standpoint, when analysing the code using the Bernstein Conditions (see image)Berstein Condition, we came to the conclusion that only S2 can be parallelised while S1 and S3 have dependencies.
However, when testing with Open-MP, we saw a significant speedup of the runtime when trying to parallelise S1 or S2 (meaning both of them should be parallelisable). Furthermore, when trying to mess with the indices in the different terms (T1 - T4), we found that S1 can be parallelised every time even when the Bernstein Conditions wouldn't be met and S3 can never be parallelised even when the Bernstein Conditions are met.
The question is: how exactly do the Open-MP pragmas work and in which case can we expect to see a speedup. Are the Bernstein Conditions not directly linked to them? Which theoretical construct could be used to test parallelisability using Open-MP?
When parallelising S1 (using
#pragma omp parallel for shared(A, n) private (i, j, k))
we saw a significant speedup in runtime (at least twice as fast) even though the Bernstein Conditions are not met.
When parallelising S2 (using
#pragma omp parallel for shared(A, n) private (j, k))
we got a similar result but this was expected since the Bernstein Conditions are met for S2.
When parallelising S3 (using
#pragma omp parallel for shared(A, n), private(k))
the runtime exploded leading us to conjecture that S3 was not parallelisable and Open-MP had to artificially introduce barriers so that the result wouldn't be different.
We then also changed the calculation to:
A[i][j][k] = A[i-1][j][k] + A[i-1][j][k+2] + A[i][j][k-1];which should obviously break the Bernstein Conditions for S1. However, we found that when trying to parallelise S1 with the same pragma as above, the code still executed twice as fast without producing a different result as a serial implementation which would suggest that S1 is still parallelisable.
Similarly, we changed the calculation again to:
A[i][j][k] = A[i+1][j-3][k-4] + A[i-1][j][k+2] + A[i+2][j-3][k-1];to meet the Bernstein Conditions for S3. However, we found that when trying to parallelise S3 with the same pragma as above, the code still executed about 10 times slower than a serial implementation which would suggest that S3 still isn't parallelisable.
3 Answers 3
TL;DR: measured absolute or relative execution time does not tell you anything about the correctness of a parallel (or serial) program.
The question is: how exactly do the Open-MP pragmas work and in which case can we expect to see a speedup.
The OpenMP specifications describe in great detail the semantics of the OpenMP pragmas. For a more specific answer, you need to ask a more specific question.
In which cases you can expect to see a speedup is a much more difficult question, and not particularly specific to OpenMP. This is the regime of Amdahl's law, which tells you that the overall speedup depends in part on how much of the program's work is actually performed in parallel.
It also depends on how much overhead the parallelization adds to the computation, where overhead comes from work and delays such as
- creating new threads
- context switching
- synchronization delays
The general wisdom is that for computationally bound workloads, splitting the work among more threads than you have compute units is not helpful, and for a given number of threads, splitting the the work more finely than is necessary is counterproductive.
Note well that a correct parallel program can consume more wall time than a corresponding correct serial program, and the parallel version almost always consumes more total CPU time.
Are the Bernstein Conditions not directly linked to them? Which theoretical construct could be used to test parallelisability using Open-MP?
The Bernstein conditions are about whether a parallel computation is (or can be) correct, not about whether it produces speedup. OpenMP does not automatically provide any memory protection that would moot the Bernstein conditions.
Which theoretical construct could be used to test parallelisability using Open-MP?
You cannot reliably test for parallelizability in the sense of performing an experiment. That is a matter of algorithmic and code analysis. What you can meaningfully test is differences in speed among correct serial and parallel implementations of the same computation. How fast an incorrect program runs is not an interesting data point -- not even if it happens to or seems to produce a result that you consider correct.
When parallelising S1 (using
#pragma omp parallel for shared(A, n) private (i, j, k))we saw a significant speedup in runtime (at least twice as fast) even though the Bernstein Conditions are not met.
Irrelevant, because the program described is incorrect. In particular, the specs say this:
Two memory operations are considered unordered if the order in which they must complete, as seen by their affected threads, is not specified by the memory consistency guarantees listed in Section 1.4.6. If multiple threads write to the same memory unit (defined consistently with the above access considerations) then a data race occurs if the writes are unordered. Similarly, if at least one thread reads from a memory unit and at least one thread writes to that same memory unit then a data race occurs if the read and write are unordered. If a data race occurs then the result of the program is unspecified.
That's essentially the Bernstein conditions. By all means, read the details if you like, but the code you describe does not engage any of the memory consistency mechanisms discussed in section 1.4.6.
When parallelising S2 (using
#pragma omp parallel for shared(A, n) private (j, k))we got a similar result but this was expected since the Bernstein Conditions are met for S2.
That appears as if it would yield a correct program. Again, however, that's an entirely separate question from whether you observe it to complete faster than another version of the program.
When parallelising S3 (using
#pragma omp parallel for shared(A, n), private(k))the runtime exploded leading us to conjecture that S3 was not parallelisable and Open-MP had to artificially introduce barriers so that the result wouldn't be different.
OpenMP does introduce a barrier at the end of each parallel region, but that doesn't make the program you describe correct. The dramatic increase in runtime does not speak to whether the program is parallelizable in the Bernstein sense or any other. It is mainly a consequence of making the tasks executed in parallel too fine grained, such that the parallelization overhead swamps any gains from parallel execution.
We then also changed the calculation to: [...]
Nothing in the above analysis much changes with the changes described to the computation.
Comments
Your code requires substantial rewriting. As you correctly observe, the j loop has j depending only on j-3, so you the j space can be divided in 3 sequential sections, starting at 3,4,5 respectively. So you can get a factor of 3 speedup.
Comments
To reduce the number of parallel region instances and therefore the number of barriers, you should always try to parallelize the outermost loop. As you already observed, only the S2 loop can be parallelized. You can make S2 the outermost loop by swapping S1 and S2. As long as you make sure that only one thread works on the j%3 iterations, parallelizing the code is trivial (without changing the code as suggested by Victor):
#pragma omp parallel for schedule(static, 1) num_threads(3)
for (int j = 3; j < n - 3; j++) { // S2
for (int i = 2; i < n; i++) { // S1
for (int k = 4; k < n - 4; k++) { // S3
A[i][j][k] = // T1
A[i + 1][j - 3][k - 4] + // T2
A[i - 1][j][k + 2] + // T3
A[i][j][k - 1]; // T4
}
}
}
Compiling the code with ThreadSanitizer (clang -g -O3 -fopenmp -fsanitize=thread test.c) and executing the binary will tell you that this version has no race:
env TSAN_OPTIONS='ignore_noninstrumented_modules=1' ./a.out
Limiting the number of threads to two will report data races due to different threads executing consecutive j%3 iterations
env TSAN_OPTIONS='ignore_noninstrumented_modules=1' OMP_THREAD_LIMIT=2 ./a.out
OpenMP 6.0 introduced the strict modifier, which would prevent the possibility of this data race by using num_threads(strict:3) instead of num_threads(3). The execution would abort if the requested number of threads cannot be provided.
I'm not sure whether any compiler already supports the strict modifier.
3 Comments
static,1 defines a round-robin distribution of single iterations per thread. For a loop with iteration values 0 to 9, the first thread will execute iterations 0, 3, 6, and 9.Explore related questions
See similar questions with these tags.
for ( int i=1;.....syntax to prevent mishaps with shared variables.A[i-1][j][k+2]andA[i][j][k-1]. Is that intentional?