TL;DR: measured absolute or relative execution time does not tell you anything about the correctness of a parallel (or serial) program.
TL;DR: measured absolute or relative execution time does not tell you anything about the correctness of a parallel program.
TL;DR: measured absolute or relative execution time does not tell you anything about the correctness of a parallel (or serial) program.
TL;DR: measured absolute or relative execution time does not tell you anything about the correctness of a parallel program.
The question is: how exactly do the Open-MP pragmas work and in which case can we expect to see a speedup.
The OpenMP specifications describe in great detail the semantics of the OpenMP pragmas. For a more specific answer, you need to ask a more specific question.
In which cases you can expect to see a speedup is a much more difficult question, and not particularly specific to OpenMP. This is the regime of Amdahl's law, which tells you that the overall speedup depends in part on how much of the program's work is actually performed in parallel.
It also depends on how much overhead the parallelization adds to the computation, where overhead comes from work and delays such as
- creating new threads
- context switching
- synchronization delays
The general wisdom is that for computationally bound workloads, splitting the work among more threads than you have compute units is not helpful, and for a given number of threads, splitting the the work more finely than is necessary is counterproductive.
Note well that a correct parallel program can consume more wall time than a corresponding correct serial program, and the parallel version almost always consumes more total CPU time.
Are the Bernstein Conditions not directly linked to them? Which theoretical construct could be used to test parallelisability using Open-MP?
The Bernstein conditions are about whether a parallel computation is (or can be) correct, not about whether it produces speedup. OpenMP does not automatically provide any memory protection that would moot the Bernstein conditions.
Which theoretical construct could be used to test parallelisability using Open-MP?
You cannot reliably test for parallelizability in the sense of performing an experiment. That is a matter of algorithmic and code analysis. What you can meaningfully test is differences in speed among correct serial and parallel implementations of the same computation. How fast an incorrect program runs is not an interesting data point -- not even if it happens to or seems to produce a result that you consider correct.
When parallelising S1 (using
#pragma omp parallel for shared(A, n) private (i, j, k))we saw a significant speedup in runtime (at least twice as fast) even though the Bernstein Conditions are not met.
Irrelevant, because the program described is incorrect. In particular, the specs say this:
Two memory operations are considered unordered if the order in which they must complete, as seen by their affected threads, is not specified by the memory consistency guarantees listed in Section 1.4.6. If multiple threads write to the same memory unit (defined consistently with the above access considerations) then a data race occurs if the writes are unordered. Similarly, if at least one thread reads from a memory unit and at least one thread writes to that same memory unit then a data race occurs if the read and write are unordered. If a data race occurs then the result of the program is unspecified.
That's essentially the Bernstein conditions. By all means, read the details if you like, but the code you describe does not engage any of the memory consistency mechanisms discussed in section 1.4.6.
When parallelising S2 (using
#pragma omp parallel for shared(A, n) private (j, k))we got a similar result but this was expected since the Bernstein Conditions are met for S2.
That appears as if it would yield a correct program. Again, however, that's an entirely separate question from whether you observe it to complete faster than another version of the program.
When parallelising S3 (using
#pragma omp parallel for shared(A, n), private(k))the runtime exploded leading us to conjecture that S3 was not parallelisable and Open-MP had to artificially introduce barriers so that the result wouldn't be different.
OpenMP does introduce a barrier at the end of each parallel region, but that doesn't make the program you describe correct. The dramatic increase in runtime does not speak to whether the program is parallelizable in the Bernstein sense or any other. It is mainly a consequence of making the tasks executed in parallel too fine grained, such that the parallelization overhead swamps any gains from parallel execution.
We then also changed the calculation to: [...]
Nothing in the above analysis much changes with the changes described to the computation.