new2 new2
WWW http://www.math.utah.edu/~beebe

OpenMP: overview and resource guide

Last updates: Tue May 8 19:16:06 2001 Fri Nov 12 15:26:10 2004 Thu Nov 13 18:30:20 2008 Mon Mar 1 16:28:36 2010

OpenMP is a relatively new (1997) development in parallel computing. It is a language-independent specification of multithreading, and implementations are available from several vendors, including

OpenMP is implemented as comments or directives in Fortran, C, and C++ code, so that its presence is invisible to compilers lacking OpenMP support. Thus, you can develop code that will run everywhere, and when OpenMP is available, will run even faster.

The OpenMP Consortium maintains a very useful Web site at http://www.openmp.org/, with links to vendors and resources.

There is an excellent overview of the advantages of OpenMP over POSIX threads ( pthreads ) and PVM/MPI in the paper OpenMP: A Proposed Industry Standard API for Shared Memory Processors, also available in HTML and PDF . This is a must-read if you are getting started in parallel programming. It contains two simple examples programmed with OpenMP, ptheads, and MPI.

The paper also gives a very convenient tabular comparison of OpenMP directives with Silicon Graphics parallelization directives.

OpenMP can be used on uniprocessor and multiprocessor systems with shared memory. It can also be used in programs that run on homogeneous or heterogeneous distributed memory environments, which are typically supported by systems like Linda, MPI , and PVM, although the OpenMP part of the code will only provide parallelization on those processors providing shared memory.

In distributed memory environments, the programmer must manually partition data between processors, and make special library calls to move the data back and forth. While that kind of code can also be used in shared memory systems, OpenMP is much simpler to program. Thus, you can start parallelization of an application using OpenMP, and then later add MPI or PVM calls: the two forms of parallelization can peacefully coexist in your program.

An extensive bibliography on multithreading, including OpenMP, is available at http://www.math.utah.edu/pub/tex/bib/index-table-m.html#multithreading. MPI and PVM are covered in a separate bibliography: http://www.math.utah.edu/pub/tex/bib/index-table-p.html#pvm

OpenMP benchmark: computation of pi

This simple benchmark for the computation of pi is taken from the paper above. Its read statement has been modified to read from stdin instead of the non-redirectable /dev/tty, and an extra final print statement has been added to show an accurate value of pi.

Follow this link for the source code a shell script to run the benchmark, a UNIX Makefile, and a small awk program to extract the timing results for inclusion in tables like the ones below.

Here is a table of compiler options needed to enable OpenMP directives during compilation:
Compaq/DEC f90 -omp
Compaq/DEC f95 -omp
IBM xlf90_r -qsmp=omp -qfixed
IBM xlf95_r -qsmp=omp -qfixed
PGI pgf77 -mp
PGI pgf90 -mp
PGI pgcc -mp
PGI pgCC -mp
SGI f77 -mp

Once you have compiled with OpenMP support, the executable may still not run multithreaded, unless you preset an environment variable that defines the number of threads to use. On most of the above systems, this variable is called OMP_NUM_THREADS. This has no effect on the IBM systems; I'm still trying to find out what is expected there.

When the Compaq/DEC benchmark below was run, there was one other single-CPU-bound process on the machine, so we should expect to have only 3 available CPUs. As the number of threads increases beyond the number of available CPUs, we should expect a performance drop, unless those threads have idle time, such as from I/O activity. For this simple benchmark, the loop is completely CPU bound. Evidently, 3 threads make almost perfect use of the machine, at a cost of only two simple OpenMP directives added to the original scalar program.

Plot of Compaq/DEC Alpha 4100-5/466 speedup
Compaq/DEC Alpha 4100-5/466: Four 466MHz CPUs
100,000,000 iterations
Number of threads Wallclock Time (sec) Speedup
1 8.310 1.000
2 4.030 2.062
3 2.780 2.989
4 2.130 3.901
5 3.470 2.395
6 2.930 2.836
7 2.520 3.298
8 2.280 3.645
Plot of Intel Pentium-III/600 speedup
Intel Pentium III: Two 600 MHz CPUs
100,000,000 iterations
Number of threads Wallclock Time (sec) Speedup
1 6.210 1.000
2 3.110 1.997
3 4.000 1.552
4 4.390 1.415
Plot of SGI Origin-200 speedup
SGI Origin 200: Four 195MHz R10000 CPUs
100,000,000 iterations
Number of threads Wallclock Time (sec) Speedup
1 28.61 1.000
2 14.33 1.997
3 9.61 2.977
4 7.63 3.750
5 9.79 2.922
6 9.80 2.919
7 9.85 2.905
8 13.15 2.176

The previous two systems were essentially idle when the benchmark was run, and, as expected, the optimal speedup is obtained when the thead count matches the number of CPUs.

The next one is a large shared system on which the load average was about 40 (that is, about 2/3 busy) when the benchmark was run. With a large number of CPUs, the work per thread is reduced, and eventually, communication and scheduling overhead dominates computation. Consequently, the number of iterations was tripled for this benchmark. Since large tables of numbers are less interesting, the speedup is shown graphically as well. At 100% efficiency, the speedup would be a 45-degree line in the plot. With a machine of this size, it is almost impossible to ever find it idle, though it would be interesting to see how well the benchmark would scale without competition from other users for the CPUs.

Plot of SGI Origin 2000 speedup
SGI Origin 2000: Sixty-four 195MHz R10000 CPUs
300,000,000 iterations
Number of threads Wallclock Time (sec) Speedup
1 32.651 1.000
2 16.348 1.997
3 10.943 2.984
4 8.272 3.947
5 7.178 4.549
6 5.794 5.635
7 4.927 6.627
8 4.446 7.344
9 4.021 8.120
10 3.577 9.128
11 3.409 9.578
12 3.021 10.808
13 2.928 11.151
14 2.645 12.344
15 2.493 13.097
16 2.414 13.526
17 2.208 14.788
18 2.170 15.047
19 2.051 15.920
20 2.051 15.920
21 2.082 15.683
22 1.791 18.231
23 1.824 17.901
24 2.457 13.289
25 2.586 12.626
26 3.134 10.418
27 5.200 6.279
28 5.454 5.987
29 3.431 9.516
30 2.427 13.453
31 3.021 10.808
32 2.418 13.503
33 5.092 6.412
34 7.601 4.296
35 8.790 3.715
36 6.369 5.127
37 6.232 5.239
38 5.588 5.843
39 6.470 5.047
40 7.166 4.556
41 6.218 5.251
42 7.450 4.383
43 6.298 5.184
44 6.475 5.043
45 15.411 2.119
46 7.466 4.373
47 8.293 3.937
48 6.872 4.751
49 8.884 3.675
50 8.006 4.078
51 9.614 3.396
52 25.223 1.294
53 10.789 3.026
54 32.958 0.991
55 35.816 0.912
56 36.213 0.902
57 8.301 3.933
58 11.487 2.842
59 71.526 0.456
60 10.361 3.151
61 52.518 0.622
62 33.081 0.987
63 32.493 1.005
64 95.322 0.343
Plot of Compaq AlphaServer ES40 DEC6600/500 speedup
Compaq AlphaServer ES40 DEC6600/500
(4 EV6 21264 CPUs, 500 MHz, 4GB RAM)
OSF/1 4.0F
1,000,000,000 iterations
Number of threads Wallclock Time (sec) Speedup
1 26.470 1.000
2 13.260 1.996
3 8.840 2.994
4 6.650 3.980
5 8.080 3.276
6 6.770 3.910
7 6.850 3.864
8 6.670 3.969
9 7.200 3.676
10 7.130 3.712
11 7.120 3.718
12 6.690 3.957
13 7.180 3.687
14 7.300 3.626
15 7.170 3.692
16 6.710 3.945
Plot of Compaq AlphaServer ES40 Sierra/667 speedup
Compaq AlphaServer ES40 Sierra/667
(32 EV6.7 21264A CPUs, 667 MHz, 8GB RAM)
100,000,000 iterations
Number of threads Wallclock Time (sec) Speedup
1 2.500 1.000
2 1.600 1.562
3 1.300 1.923
4 1.500 1.667
5 2.000 1.250
6 2.000 1.250
7 1.800 1.389
8 1.200 2.083
9 1.500 1.667
10 1.900 1.316
11 1.900 1.316
12 1.900 1.316
13 3.200 0.781
14 2.400 1.042
15 1.900 1.316
16 2.200 1.136
17 1.900 1.316
18 1.800 1.389
19 2.100 1.190
20 1.600 1.562
21 2.600 0.962
22 1.500 1.667
23 1.800 1.389
24 1.600 1.562
25 1.500 1.667
26 2.100 1.190
27 1.800 1.389
28 1.700 1.471
29 2.200 1.136
30 2.400 1.042
31 2.100 1.190
32 2.500 1.000
33 2.500 1.000
34 1.900 1.316
35 1.800 1.389
36 2.500 1.000
37 1.600 1.562
38 1.600 1.562
39 2.200 1.136
40 2.500 1.000
41 2.200 1.136
42 1.500 1.667
43 3.100 0.806
44 2.400 1.042
45 2.500 1.000
46 2.400 1.042
47 2.500 1.000
48 1.600 1.562
49 3.300 0.758
50 2.200 1.136
51 2.600 0.962
52 3.200 0.781
53 2.400 1.042
54 1.800 1.389
55 3.000 0.833
56 4.900 0.510
57 1.800 1.389
58 2.700 0.926
59 3.100 0.806
60 2.700 0.926
61 3.600 0.694
62 3.000 0.833
63 2.300 1.087
64 3.700 0.676
Sun SPARC Enterprise T5240
(two 8-core CPUs, 128 threads, 1200 MHz UltraSPARC T2 Plus, 64GB RAM)
Solaris 10
108 iterations
Plot of Sun SPARC Enterprise T5240 speedup
109 iterations
Plot of Sun SPARC Enterprise T5240 speedup
1010 iterations
Plot of Sun SPARC Enterprise T5240 speedup
Test machine for benchmarking (vendor withheld)
(4 CPUs, 16 threads/CPU) GNU/Linux
Plot of test machine speedup

AltStyle によって変換されたページ (->オリジナル) /