Table of contents:
-g) to be able to debug MPI applications?
gdb) to debug MPI
applications?
This is a difficult question. Debugging in serial can be tricky: errors, uninitialized variables, stack smashing, etc. Debugging in parallel adds multiple different dimensions to this problem: a greater propensity for race conditions, asynchronous events, and the general difficulty of trying to understand N processes simultaneously executing — the problem becomes quite formidable.
This FAQ section does not provide any definite solutions to debugging in parallel. At best, it shows some general techniques and a few specific examples that may be helpful to your situation.
But there are various controls within Open MPI that can help with debugging. These are probably the most valuable entries in this FAQ section.
There are two main categories of tools that can aid in parallel debugging:
Both freeware and commercial solutions are available for each kind of tool.
See these FAQ entries:
Open MPI has a series of MCA parameters for the MPI layer itself that are designed to help with debugging. These parameters can be can be set in the usual ways. MPI-level MCA parameters can be displayed by invoking the following command:
1 2 3 4 5 6
# Starting with Open MPI v1.7, you must use "--level 9" to see # all the MCA parameters (the default is "--level 1"): shell$ ompi_info --param mpi all --level 9 # Before Open MPI v1.7: shell$ ompi_info --param mpi all
Here is a summary of the debugging parameters for the MPI layer:
ompi_info will
report that they exist).
-g) to be able to debug MPI applications?No.
If you build Open MPI without compiler/linker debugging flags (such as
-g), you will not be able to step inside MPI functions
when you debug your MPI applications. However, this is likely what
you want — the internals of Open MPI are quite complex and you
probably don't want to start poking around in there.
You'll need to compile your own applications with -g (or whatever
your compiler's equivalent is), but unless you have a need/desire to
be able to step into MPI functions to see the internals of Open MPI,
you do not need to build Open MPI with -g.
gdb) to debug MPI
applications?Yes; the Open MPI developers do this all the time.
There are two common ways to use serial debuggers:
For example, launch your MPI application as normal with mpirun.
Then login to the node(s) where your application is running and use
the --pid option to gdb to attach to your application.
An inelegant-but-functional technique commonly used with this method is to insert the following code in your application where you want to attach:
This code will output a line to stdout outputting the name of the host
where the process is running and the PID to attach to. It will then
spin on the sleep() function forever waiting for you to attach with
a debugger. Using sleep() as the inside of the loop means that the
processor won't be pegged at 100% while waiting for you to attach.
Once you attach with a debugger, go up the function stack until you
are in this block of code (you'll likely attach during the sleep())
then set the variable i to a nonzero value. With GDB, the syntax
is:
1
(gdb) set var i = 7
Then set a breakpoint after your block of code and continue execution until the breakpoint is hit. Now you have control of your live MPI application and use of the full functionality of the debugger.
You can even add conditionals to only allow this "pause" in the application for specific MPI processes (e.g., MPI_COMM_WORLD rank 0, or whatever process is misbehaving).
mpirun to launch separate instances
of serial debuggers.
This technique launches a separate window for each MPI process in
MPI_COMM_WORLD, each one running a serial debugger (such as gdb)
that will launch and run your MPI application. Having a separate
window for each MPI process can be quite handy for low process-count
MPI jobs, but requires a bit of setup and configuration that is
outside of Open MPI to work properly. A naive approach would be to
assume that the following would immediately work:
1
shell$ mpirun -np 4 xterm -e gdb my_mpi_application
If running on a personal computer, this will probably work.
You can also use
tmpi to launch the debuggers in separate tmux
panes instead of separate xterm windows, which has the advantage of
synchronizing keyboard input between all debugger instances.
Unfortunately, the tmpi or xterm approaches likely won't work
on an computing cluster. Several factors must be considered:
ssh when it is available, falling back to
rsh when ssh cannot be found in the $PATH. But note that Open
MPI closes the ssh (or rsh) sessions when the MPI job starts for
scalability reasons. This means that the built-in SSH X forwarding
tunnels will be shut down before the xterms can be launched.
Although it is possible to force Open MPI to keep its SSH connections
active (to keep the X tunneling available), we recommend using
non-SSH-tunneled X connections, if possible (see below).mpirun may be
copied to all nodes. In this case, the DISPLAY environment variable
may not be suitable.xterms) are running and the host connected
to the output display.The easiest way to get remote X applications (such as
xterm) to display on your local screen is to forego the
security of SSH-tunneled X forwarding. In a closed environment such
as an HPC cluster, this may be an acceptable practice (indeed, you may
not even have the option of using SSH X forwarding if SSH logins
to cluster nodes are disabled), but check with your security
administrator to be sure.
If using non-encrypted X11 forwarding is permissible, we recommend the following:
xhost command.
For example:
1 2 3 4 5 6
shell$ cat my_hostfile inky blinky stinky clyde shell$ for host in `cat my_hostfile` ; do xhost +host ; done
-x option to mpirun to export an appropriate DISPLAY
variable so that the launched X applications know where to send their
output. An appropriate value is usually (but not always) the
hostname containing the display where you want the output and the :0
(or :0.0) suffix. For example:
1 2 3 4
shell$ hostname arcade.example.come shell$ mpirun -np 4 --hostfile my_hostfile \ -x DISPLAY=arcade.example.com:0 xterm -e gdb my_mpi_application
Note that X traffic is fairly "heavy" — if you are operating over a
slow network connection, it may take some time before the xterm
windows appear on your screen.
xterm supports it, the -hold option may be useful.
-hold tells xterm to stay open even when the application has
completed. This means that if something goes wrong (e.g., gdb fails
to execute, or unexpectedly dies, or ...), the xterm window will
stay open, allowing you to see what happened, instead of closing
immediately and losing whatever error message may have been
output.xhost again to
disable these permissions:
1
shell$ for host in `cat my_hostfile` ; do xhost -host ; done
Note that mpirun will not complete until all the xterms
complete.
There many be many reasons for this; the Open MPI Team strongly encourages the use of tools (such as debuggers) whenever possible.
One of the reasons, however, may come from inside Open MPI itself. If your application fails due to memory corruption, Open MPI may subsequently fail to output an error message before dying. Specifically, starting with v1.3, Open MPI attempts to aggregate error messages from multiple processes in an attempt to show unique error messages only once (vs. one for each MPI process — which can be unwieldy, especially when running large MPI jobs).
However, this aggregation process requires allocating memory in the MPI process when it displays the error message. If the process's memory is already corrupted, Open MPI's attempt to allocate memory may fail and the process will simply die, possibly silently. When Open MPI does not attempt to aggregate error messages, most of its setup work is done during MPI_INIT and no memory is allocated during the "print the error" routine. It therefore almost always successfully outputs error messages in real time — but at the expense that you'll potentially see the same error message for each MPI process that encountered the error.
Hence, the error message aggregation is usually a good thing, but
sometimes it can mask a real error. You can disable Open MPI's error
message aggregation with the orte_base_help_aggregate MCA
parameter. For example:
1
shell$ mpirun --mca orte_base_help_aggregate 0 ...
The Memchecker-MCA is implemented to allow MPI-semantic checking for your application (as well as internals of Open MPI), with the help of memory checking tools such as the Memcheck of the Valgrind-suite (http://www.valgrind.org/).
Memchecker component is included in Open MPI v1.3 and later.
Memchecker is implemented on the basis of the Memcheck tool from Valgrind, so it takes all the advantages from it. Firstly, it checks all reads and writes of memory, and intercepts calls to malloc/new/free/delete. Most importantly, Memchecker is able to detect the user buffer errors in both Non-blocking and One-sided communications, e.g. reading or writing to buffers of active non-blocking Recv-operations and writing to buffers of active non-blocking Send-operations.
Here are some example codes that Memchecker can detect:
Accessing buffer under control of non-blocking communication:
1 2 3 4 5
int buf; MPI_Irecv(&buf, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &req); // The following line will produce a memchecker warning buf = 4711; MPI_Wait (&req, &status);
Wrong input parameters, e.g. wrongly sized send buffers:
Accessing window under control of one-sided communication:
1 2 3
MPI_Get(A, 10, MPI_INT, 1, 0, 1, MPI_INT, win); A[0] = 4711; MPI_Win_fence(0, win);
Uninitialized input buffers:
1 2 3 4
char *buffer; buffer = malloc (10); // The following line will produce a memchecker warning MPI_Send(buffer, 10, MPI_INT, 1, 0, MPI_COMM_WORLD);
Usage of the uninitialized MPI_Status field in MPI_ERROR structure: (The MPI-1 standard defines the MPI ERROR-field to be undefined for single-completion calls such as MPI Wait or MPI Test, see MPI-1 p. 22):
1 2 3 4
MPI_Wait(&request, &status); // The following line will produce a memchecker warning if (status.MPI_ERROR != MPI_SUCCESS) return ERROR;
To use Memchecker, you need Open MPI 1.3 or later, and Valgrind 3.2.0 or later.
As this functionality is off by default, one needs to turn them on
with the configure flag --enable-memchecker. Then, configure will
check for a recent Valgrind-distribution and include the compilation
of ompi/opal/mca/memchecker. You may ensure that the library is
being built by using the ompi_info application. Please note that
all of this will only make sense together with --enable-debug, which
is required by Valgrind for outputting messages pointing directly to
the relevant source code lines. Otherwise, without debugging info,
the messages from Valgrind are nearly useless.
Here is a configuration example to enable Memchecker:
1 2
shell$ ./configure --prefix=/path/to/openmpi --enable-debug \ --enable-memchecker --with-valgrind=/path/to/valgrind
To check if Memchecker is successfully enabled after installation, simply run this command:
1
shell$ ompi_info | grep memchecker
You will get an output message like this:
1
MCA memchecker: valgrind (MCA v1.0, API v1.0, Component v1.3)
Otherwise, you probably didn't configure and install Open MPI correctly.
First of all, you have to make sure that Valgrind 3.2.0 or later is installed, and Open MPI is compiled with Memchecker enabled. Then simply run your application with Valgrind, e.g.:
1
shell$ mpirun -np 2 valgrind ./my_app
Or if you enabled Memchecker, but you don't want to check the application at this time, then just run your application as usual. E.g.:
1
shell$ mpirun -np 2 ./my_app
The configure option --enable-memchecker (together with --enable-debug) does
cause performance degradation, even if not running under Valgrind.
The following explains the mechanism and may help in making the decision
whether to provide a cluster-wide installation with --enable-memchecker.
There are two cases:
valgrind.h):
1 2 3
#define __SPECIAL_INSTRUCTION_PREAMBLE \ "rolq 3,ドル %%rdi; rolq 13,ドル %%rdi\n\t" \ "rolq 61,ドル %%rdi; rolq 51,ドル %%rdi\n\t"
1 2 3 4 5 6 7
__asm__ volatile(__SPECIAL_INSTRUCTION_PREAMBLE \ /* %RDX = client_request ( %RAX ) */ \ "xchgq %%rbx,%%rbx" \ : "=d" (_zzq_result) \ : "a" (& _zzq_args[0]), "0" (_zzq_default) \ : "cc", "memory" \ ); \
The first request is checking whether we're running under Valgrind. In case we're not running under Valgrind subsequent checks (aka ClientRequests) are not done.
Further information and performance data with the NAS Parallel Benchmarks may be found in the paper Memory Debugging of MPI-Parallel Applications in Open MPI.
This issue has been raised many times on the mailing list, e.g., here or here.
There are many situations where Open MPI purposefully does not initialize and
subsequently communicates memory, e.g., by calling writev.
Furthermore, several cases are known where memory is not properly freed upon
MPI_Finalize.
This certainly does not help distinguishing real errors from false positives. Valgrind provides functionality to suppress errors and warnings from certain function contexts.
In an attempt to ease debugging using Valgrind, starting with v1.5, Open MPI provides a so-called Valgrind-suppression file, that can be passed on the command line:
1
mpirun -np 2 valgrind --suppressions=$PREFIX/share/openmpi/openmpi-valgrind.supp
More information on suppression-files and how to generate them can be found in Valgrind's Documentation.