Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

TL/UCP: Add all-reduce ring alogorithm #1082

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
armratner wants to merge 2 commits into openucx:master
base: master
Choose a base branch
Loading
from armratner:all_reduce_ring_new

Conversation

@armratner
Copy link

@armratner armratner commented Feb 25, 2025

What

This PR adds a new ring-based Allreduce algorithm (named "ring") to the UCP transport layer within UCC. It introduces:

  • A new source file allreduce_ring.c implementing the ring-based method.
  • Modifications to the build system (Makefile.am) to include the new file.
  • Updates to the Allreduce interface (allreduce.[ch]), including a new enum value UCC_TL_UCP_ALLREDUCE_ALG_RING, new function prototypes, and references in the algorithm registration.
  • The ring-based algorithm’s logic (init, start, progress, and finalize) in allreduce_ring.c that manages per-rank scratch buffers, chunk-based sending/receiving, and reduction.

Why ?

A ring-based Allreduce can be more efficient for large message sizes, especially on relatively simple or homogeneous network topologies. It complements existing Allreduce algorithms (e.g., knomial, sliding window, DBT) by providing:

  • Improved scalability for certain message sizes.
  • A straightforward method for ring-style communication patterns common in distributed HPC and AI workloads.

How ?

The ring algorithm splits the input data into chunks, then circulates these chunks around the ring of ranks. Each rank performs local partial reductions on received data and passes it along. The main changes include:

  1. File Additions/Modifications:

    • allreduce_ring.c: Implements the ring-based send/recv steps, in-place or out-of-place usage, and partial data reductions via ucc_dt_reduce.
    • Makefile.am: Includes the new file in the build.
    • allreduce.c/allreduce.h: Adds the new "ring" algorithm ID and associated function prototypes.
  2. Implementation Details:

    • Data is divided into num_chunks, typically equal to the number of ranks. Each chunk is passed around the ring (sendto/recvfrom) and reduced in a scratch buffer.
    • A scratch buffer is allocated per rank to hold incoming chunk data before reduction.
    • The algorithm ensures all chunks complete one round in the ring, then finalizes once the entire data is fully reduced on each rank.
  3. Code Flow:

    • Init: Sets up the ring task, scratch buffer, and references to the team’s executor.
    • Start: Posts initial sends/receives and enqueues the progress function.
    • Progress: Drives the ring of sends/receives chunk by chunk, calling ucc_dt_reduce on each incoming portion.
    • Finalize: Cleans up (frees scratch space and finishes the task).

By adding this ring-based approach, UCC gains a more complete suite of collective algorithms for Allreduce, allowing users and internal heuristics to pick the best method based on message size, topology, and system capabilities.

Copy link

Can one of the admins verify this patch?

Copy link
Author

Working on Gtest

samnordmann reacted with thumbs up emoji

Copy link
Author

Added the gtest,

  1. Data Type and Operation Coverage
    • INT32 with sum operation
    • FLOAT32 with sum operation
    • INT32 with product operation
    • INT32 with max operation
    • INT32 with min operation
    • FLOAT64 with sum operation
  2. Standard Test (ring)
    - Tests the ring algorithm with standard configurations:
    - Different data sizes (8, 65536, 123567 elements)
    - Both in-place and non-in-place operations
    - Different memory types (HOST, CUDA, CUDA_MANAGED where available)
    - 3 iterations per configuration to ensure stability
  3. Edge Case Test (ring_edge_cases)
    - Tests the ring algorithm with non-power-of-two team sizes (3, 7, 13)
    - Tests edge cases with message sizes:
    - Empty message (0 elements)
    - Single element (1 element)
    - Small odd sizes (3, 17 elements)
  4. Persistent Operation Test (ring_persistent)
    - Tests the algorithm's behavior in persistent mode
    - Uses a consistent buffer size (1024 elements)
    - Runs 5 iterations to verify consistency

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, however I do not see the tests passing the CI logs. Can you push a new commit to Trigger the CI? @janjust do we know why the Jenkin CI hasnt been triggered?

I think the algo might run into a segfault because the executor is not initialized

Copy link
Collaborator

@nsarka nsarka left a comment
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the alg!

Copy link
Collaborator

janjust commented Apr 15, 2025

@janjust do we know why the Jenkin CI hasnt been triggered?
Not sure - we had issues with jenkins for a bit now, but working with Artem and Andrii to resolve it - should be good now, we can just retrigger.

Copy link
Collaborator

@nsarka nsarka left a comment
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please double check the gtest, it may not have been running the ring alg during the test

@armratner armratner force-pushed the all_reduce_ring_new branch 2 times, most recently from e120c3f to 5bd39b5 Compare April 23, 2025 15:49
Copy link
Author

| Message Size
Emphasising the Latency Jump at 8 MB

(MB) UCC Ring (μs) UCC Knomial (μs) OMPI Tuned (μs)
1 2 287.61 579.78 3 710.72
2 2 656.15 983.58 7 397.25
4 4 380.01 3 531.90 15 340.96
8 11 228.21 12 728.49 32 010.70
16 26 984.31 31 163.98 70 183.02
32 50 100.87 69 753.03 154 048.94
64 100 628.53 145 987.70 314 291.89
128 200 764.59 295 283.81 638 286.61
256 405 462.40 590 662.48
512 992 232.12 1 182 699.49

Copy link
Author

Scale context – 128 ranks / 4 nodes / 32 PPN

armratner added 2 commits June 12, 2025 10:47
Signed-off-by: Armen Ratner <armeng@nvidia.com>
- Add tests for various data types and reduction operations
- Test edge cases with non-power-of-two team sizes and odd message sizes
- Test persistent operations for stability
- Test with different memory types (HOST, CUDA, CUDA_MANAGED where available)
Signed-off-by: Armen Ratner <armeng@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

@janjust janjust Awaiting requested review from janjust

@samnordmann samnordmann Awaiting requested review from samnordmann

@ikryukov ikryukov Awaiting requested review from ikryukov

@nsarka nsarka Awaiting requested review from nsarka

@Sergei-Lebedev Sergei-Lebedev Awaiting requested review from Sergei-Lebedev

At least 2 approving reviews are required to merge this pull request.

Assignees

No one assigned

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

AltStyle によって変換されたページ (->オリジナル) /