Run NCCL on Slurm clusters

This page describes how to run NCCL/gIB tests on a Slurm cluster. Choose the steps for your machine type:

A4X and A4 machines

The following test uses Ramble, which is an open-source, multi-platform experimentation framework written in Python that is used to coordinate the running of NCCL tests. Ramble and its dependencies are compatible with the ARM64 architecture used by A4X machines.

The run scripts used for this test are staged in the /opt/apps/system_benchmarks on the Slurm controller node and are available to all nodes in the cluster. Running this test installs Ramble to the /opt/apps/ramble directory.

  1. From the login node in the ${HOME} directory, run the following command. Because the test can take approximately 10 minutes, or longer if other jobs are in the queue, the following command uses nohup and redirects the stdout/err to a log file .

    nohup bash /opt/apps/system_benchmarks/run-nccl-tests-via-ramble.sh>& nccl.log &

    This command creates a folder called nccl-tests_$(date +%s) that stores all of the test results. The date tag ensures that a unique folder is created based on each current timestamp.

    For example, if your cluster has 16 nodes then NCCL tests are ran for all-gather, all-reduce, and reduce-scatter on 2, 4, 8, and 16 nodes.

  2. Review the results. The nccl.log contains the logs from setting up and running the test. To view these logs, run the following:

    tail -f nccl.log

    You can also use Ctrl+C to stop tailing the output at any time. At the end of the nccl.log, your output should resemble the following:

    ...
    ---- SUMMARY for>1GB Message Sizes ----
    workload n_nodes msg_size busbw
    all-gather 2 1073741824 ###.##
    all-gather 2 2147483648 ###.##
    all-gather 2 4294967296 ###.##
    all-gather 2 8589934592 ###.##
    ...
    all-reduce 2 1073741824 ###.##
    ...
    reduce-scatter 2 1073741824 ###.##
    ...
    -------- Benchmarking Complete -------
    

    All of the Slurm job scripts and nccl-tests output logs are stored in the nccl-tests_$(date +%s)/experiments directory. A summary of the NCCL test performance is also stored in the nccl-tests_${date +%s)/summary.tsv file.

    Removing nccl-tests_$(date +%s)/ directory removes all of the files generated during these tests.

A3 Ultra machines

  1. Download the script needed to build the NCCL test by running the following command from the shared directory of the login node (this node is usually located at ${HOME}):

    wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-ultragpu-8g/nccl-tests/build-nccl-tests.sh
  2. After the script downloads, import a Pytorch image from the NVIDIA container registry and build the NCCL tests. To do this, run the following command:

    sbatch build-nccl-tests.sh

    The preceding script runs on one of your nodes. It uses the --container-mounts switch to mount your current directory, $PWD, into the /nccl directory within the container.

  3. Verify that the NCCL test is built. To verify this, run the following command:

    sacct -a

    If successfully completed, the output is similar to the following:

    JobID JobName Partition Account AllocCPUS State ExitCode
    ------------ ---------- ---------- ---------- ---------- ---------- --------
    1 build-ncc+ a3ultra 112 COMPLETED 0:0
    

    If the build is successful you should also have a file named nvidia+pytorch+24.09-py3.sqsh in the directory where you ran the command along with a directory named nccl-tests.

  4. Check that the nccl-tests/build folder contains several binaries, including all_gather_perf.

  5. Download the NCCL test script.

    wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-ultragpu-8g/nccl-tests/run-nccl-tests.sh

    To run any job run on an A3 Ultra cluster, several environment variables must be set in order to enable high performance networking with RDMA. Because you use enroot containers in this procedure to launch workloads, these variables must be set in the container environment as opposed to the host environment. These variables can be inspected in the run-nccl-tests.sh script that you just downloaded.

  6. Run the NCCL test script. The test can take approximately 15 minutes, or longer.

    sbatch run-nccl-tests.sh
  7. Review the results. The script outputs a slurm-XX.out file that contains the result of the nccl all_gather_perf benchmark.

    The output is similar to the following:

    #
    # out-of-place in-place
    # size count type redop root time algbw busbw #wrong time algbw busbw #wrong
    # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
     268435456 4194304 float none -1 ##### ###.## ###.## N/A ###### ###.## ###.## 0
     536870912 8388608 float none -1 ##### ###.## ###.## N/A ###### ###.## ###.## 0
     1073741824 16777216 float none -1 ##### ###.## ###.## N/A ###### ###.## ###.## 0
     2147483648 33554432 float none -1 ##### ###.## ###.## N/A ###### ###.## ###.## 0
     4294967296 67108864 float none -1 ##### ###.## ###.## N/A ###### ###.## ###.## 0
     8589934592 134217728 float none -1 ##### ###.## ###.## N/A ###### ###.## ###.## 0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth : ###.##
    #
    

What's next

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025年11月24日 UTC.