OpenMP: overview and resource guide

Last updates: Tue May 8 19:16:06 2001 Fri Nov 12 15:26:10 2004 Thu Nov 13 18:30:20 2008 Mon Mar 1 16:28:36 2010

OpenMP is a relatively new (1997) development in parallel computing. It is a language-independent specification of multithreading, and implementations are available from several vendors, including

OpenMP is implemented as comments or directives in Fortran, C, and C++ code, so that its presence is invisible to compilers lacking OpenMP support. Thus, you can develop code that will run everywhere, and when OpenMP is available, will run even faster.

The OpenMP Consortium maintains a very useful Web site at http://www.openmp.org/, with links to vendors and resources.

There is an excellent overview of the advantages of OpenMP over POSIX threads ( pthreads ) and PVM/MPI in the paper OpenMP: A Proposed Industry Standard API for Shared Memory Processors, also available in HTML and PDF . This is a must-read if you are getting started in parallel programming. It contains two simple examples programmed with OpenMP, ptheads, and MPI.

The paper also gives a very convenient tabular comparison of OpenMP directives with Silicon Graphics parallelization directives.

OpenMP can be used on uniprocessor and multiprocessor systems with shared memory. It can also be used in programs that run on homogeneous or heterogeneous distributed memory environments, which are typically supported by systems like Linda, MPI, and PVM, although the OpenMP part of the code will only provide parallelization on those processors providing shared memory.

In distributed memory environments, the programmer must manually partition data between processors, and make special library calls to move the data back and forth. While that kind of code can also be used in shared memory systems, OpenMP is much simpler to program. Thus, you can start parallelization of an application using OpenMP, and then later add MPI or PVM calls: the two forms of parallelization can peacefully coexist in your program.

An extensive bibliography on multithreading, including OpenMP, is available at http://www.math.utah.edu/pub/tex/bib/index-table-m.html#multithreading. MPI and PVM are covered in a separate bibliography: http://www.math.utah.edu/pub/tex/bib/index-table-p.html#pvm

`OpenMP` benchmark: computation of pi

This simple benchmark for the computation of pi is taken from the paper above. Its read statement has been modified to read from stdin instead of the non-redirectable /dev/tty, and an extra final print statement has been added to show an accurate value of pi.

Follow this link for the source code a shell script to run the benchmark, a UNIX Makefile, and a small awk program to extract the timing results for inclusion in tables like the ones below.

Here is a table of compiler options needed to enable OpenMP directives during compilation:

Compaq/DEC f90 -omp

Compaq/DEC f95 -omp

IBM xlf90_r -qsmp=omp -qfixed

IBM xlf95_r -qsmp=omp -qfixed

PGI pgf77 -mp

PGI pgf90 -mp

PGI pgcc -mp

PGI pgCC -mp

SGI f77 -mp

Once you have compiled with OpenMP support, the executable may still not run multithreaded, unless you preset an environment variable that defines the number of threads to use. On most of the above systems, this variable is called OMP_NUM_THREADS. This has no effect on the IBM systems; I'm still trying to find out what is expected there.

When the Compaq/DEC benchmark below was run, there was one other single-CPU-bound process on the machine, so we should expect to have only 3 available CPUs. As the number of threads increases beyond the number of available CPUs, we should expect a performance drop, unless those threads have idle time, such as from I/O activity. For this simple benchmark, the loop is completely CPU bound. Evidently, 3 threads make almost perfect use of the machine, at a cost of only two simple OpenMP directives added to the original scalar program.

Plot of Compaq/DEC Alpha 4100-5/466 speedup

Compaq/DEC Alpha 4100-5/466: Four 466MHz CPUs

100,000,000 iterations

Number of threads Wallclock Time (sec) Speedup

1 8.310 1.000

2 4.030 2.062

3 2.780 2.989

4 2.130 3.901

5 3.470 2.395

6 2.930 2.836

7 2.520 3.298

8 2.280 3.645

Plot of Intel Pentium-III/600 speedup

Intel Pentium III: Two 600 MHz CPUs

100,000,000 iterations

Number of threads Wallclock Time (sec) Speedup

1 6.210 1.000

2 3.110 1.997

3 4.000 1.552

4 4.390 1.415

Plot of SGI Origin-200 speedup

SGI Origin 200: Four 195MHz R10000 CPUs

100,000,000 iterations

Number of threads Wallclock Time (sec) Speedup

1 28.61 1.000

2 14.33 1.997

3 9.61 2.977

4 7.63 3.750

5 9.79 2.922

6 9.80 2.919

7 9.85 2.905

8 13.15 2.176

The previous two systems were essentially idle when the benchmark was run, and, as expected, the optimal speedup is obtained when the thead count matches the number of CPUs.

The next one is a large shared system on which the load average was about 40 (that is, about 2/3 busy) when the benchmark was run. With a large number of CPUs, the work per thread is reduced, and eventually, communication and scheduling overhead dominates computation. Consequently, the number of iterations was tripled for this benchmark. Since large tables of numbers are less interesting, the speedup is shown graphically as well. At 100% efficiency, the speedup would be a 45-degree line in the plot. With a machine of this size, it is almost impossible to ever find it idle, though it would be interesting to see how well the benchmark would scale without competition from other users for the CPUs.

Plot of SGI Origin 2000 speedup

SGI Origin 2000: Sixty-four 195MHz R10000 CPUs

300,000,000 iterations

Number of threads Wallclock Time (sec) Speedup

1 32.651 1.000

2 16.348 1.997

3 10.943 2.984

4 8.272 3.947

5 7.178 4.549

6 5.794 5.635

7 4.927 6.627

8 4.446 7.344

9 4.021 8.120

10 3.577 9.128

11 3.409 9.578

12 3.021 10.808

13 2.928 11.151

14 2.645 12.344

15 2.493 13.097

16 2.414 13.526

17 2.208 14.788

18 2.170 15.047

19 2.051 15.920

20 2.051 15.920

21 2.082 15.683

22 1.791 18.231

23 1.824 17.901

24 2.457 13.289

25 2.586 12.626

26 3.134 10.418

27 5.200 6.279

28 5.454 5.987

29 3.431 9.516

30 2.427 13.453

31 3.021 10.808

32 2.418 13.503

33 5.092 6.412

34 7.601 4.296

35 8.790 3.715

36 6.369 5.127

37 6.232 5.239

38 5.588 5.843

39 6.470 5.047

40 7.166 4.556

41 6.218 5.251

42 7.450 4.383

43 6.298 5.184

44 6.475 5.043

45 15.411 2.119

46 7.466 4.373

47 8.293 3.937

48 6.872 4.751

49 8.884 3.675

50 8.006 4.078

51 9.614 3.396

52 25.223 1.294

53 10.789 3.026

54 32.958 0.991

55 35.816 0.912

56 36.213 0.902

57 8.301 3.933

58 11.487 2.842

59 71.526 0.456

60 10.361 3.151

61 52.518 0.622

62 33.081 0.987

63 32.493 1.005

64 95.322 0.343

Plot of Compaq AlphaServer ES40 DEC6600/500 speedup

Compaq AlphaServer ES40 DEC6600/500
(4 EV6 21264 CPUs, 500 MHz, 4GB RAM)
OSF/1 4.0F

1,000,000,000 iterations

Number of threads Wallclock Time (sec) Speedup

1 26.470 1.000

2 13.260 1.996

3 8.840 2.994

4 6.650 3.980

5 8.080 3.276

6 6.770 3.910

7 6.850 3.864

8 6.670 3.969

9 7.200 3.676

10 7.130 3.712

11 7.120 3.718

12 6.690 3.957

13 7.180 3.687

14 7.300 3.626

15 7.170 3.692

16 6.710 3.945

Plot of Compaq AlphaServer ES40 Sierra/667 speedup

Compaq AlphaServer ES40 Sierra/667
(32 EV6.7 21264A CPUs, 667 MHz, 8GB RAM)

100,000,000 iterations

Number of threads Wallclock Time (sec) Speedup

1 2.500 1.000

2 1.600 1.562

3 1.300 1.923

4 1.500 1.667

5 2.000 1.250

6 2.000 1.250

7 1.800 1.389

8 1.200 2.083

9 1.500 1.667

10 1.900 1.316

11 1.900 1.316

12 1.900 1.316

13 3.200 0.781

14 2.400 1.042

15 1.900 1.316

16 2.200 1.136

17 1.900 1.316

18 1.800 1.389

19 2.100 1.190

20 1.600 1.562

21 2.600 0.962

22 1.500 1.667

23 1.800 1.389

24 1.600 1.562

25 1.500 1.667

26 2.100 1.190

27 1.800 1.389

28 1.700 1.471

29 2.200 1.136

30 2.400 1.042

31 2.100 1.190

32 2.500 1.000

33 2.500 1.000

34 1.900 1.316

35 1.800 1.389

36 2.500 1.000

37 1.600 1.562

38 1.600 1.562

39 2.200 1.136

40 2.500 1.000

41 2.200 1.136

42 1.500 1.667

43 3.100 0.806

44 2.400 1.042

45 2.500 1.000

46 2.400 1.042

47 2.500 1.000

48 1.600 1.562

49 3.300 0.758

50 2.200 1.136

51 2.600 0.962

52 3.200 0.781

53 2.400 1.042

54 1.800 1.389

55 3.000 0.833

56 4.900 0.510

57 1.800 1.389

58 2.700 0.926

59 3.100 0.806

60 2.700 0.926

61 3.600 0.694

62 3.000 0.833

63 2.300 1.087

64 3.700 0.676

Sun SPARC Enterprise T5240
(two 8-core CPUs, 128 threads, 1200 MHz UltraSPARC T2 Plus, 64GB RAM)
Solaris 10

10⁸ iterations

Plot of Sun SPARC Enterprise T5240 speedup

10⁹ iterations

Plot of Sun SPARC Enterprise T5240 speedup

10¹⁰ iterations

Plot of Sun SPARC Enterprise T5240 speedup

Test machine for benchmarking (vendor withheld)
(4 CPUs, 16 threads/CPU) GNU/Linux

Plot of test machine speedup

OpenMP: overview and resource guide

OpenMP benchmark: computation of pi

`OpenMP` benchmark: computation of pi