Search code, repositories, users, issues, pull requests...

Copy link

@armratner armratner commented Feb 25, 2025

What

This PR adds a new ring-based Allreduce algorithm (named "ring") to the UCP transport layer within UCC. It introduces:

A new source file allreduce_ring.c implementing the ring-based method.
Modifications to the build system (Makefile.am) to include the new file.
Updates to the Allreduce interface (allreduce.[ch]), including a new enum value UCC_TL_UCP_ALLREDUCE_ALG_RING, new function prototypes, and references in the algorithm registration.
The ring-based algorithm’s logic (init, start, progress, and finalize) in allreduce_ring.c that manages per-rank scratch buffers, chunk-based sending/receiving, and reduction.

Why ?

A ring-based Allreduce can be more efficient for large message sizes, especially on relatively simple or homogeneous network topologies. It complements existing Allreduce algorithms (e.g., knomial, sliding window, DBT) by providing:

Improved scalability for certain message sizes.
A straightforward method for ring-style communication patterns common in distributed HPC and AI workloads.

How ?

The ring algorithm splits the input data into chunks, then circulates these chunks around the ring of ranks. Each rank performs local partial reductions on received data and passes it along. The main changes include:

File Additions/Modifications:
- allreduce_ring.c: Implements the ring-based send/recv steps, in-place or out-of-place usage, and partial data reductions via ucc_dt_reduce.
- Makefile.am: Includes the new file in the build.
- allreduce.c/allreduce.h: Adds the new "ring" algorithm ID and associated function prototypes.
Implementation Details:
- Data is divided into num_chunks, typically equal to the number of ranks. Each chunk is passed around the ring (sendto/recvfrom) and reduced in a scratch buffer.
- A scratch buffer is allocated per rank to hold incoming chunk data before reduction.
- The algorithm ensures all chunks complete one round in the ring, then finalizes once the entire data is fully reduced on each rank.
Code Flow:
- Init: Sets up the ring task, scratch buffer, and references to the team’s executor.
- Start: Posts initial sends/receives and enqueues the progress function.
- Progress: Drives the ring of sends/receives chunk by chunk, calling ucc_dt_reduce on each incoming portion.
- Finalize: Cleans up (frees scratch space and finishes the task).

By adding this ring-based approach, UCC gains a more complete suite of collective algorithms for Allreduce, allowing users and internal heuristics to pick the best method based on message size, topology, and system capabilities.

@swx-jenkins3

Copy link

swx-jenkins3 commented Feb 25, 2025

Can one of the admins verify this patch?

Copy link

Author

armratner commented Feb 25, 2025

Working on Gtest

@nsarka nsarka requested review from Sergei-Lebedev, janjust and samnordmann

February 27, 2025 16:09

samnordmann

samnordmann reviewed

Mar 5, 2025

src/components/tl/ucp/allreduce/allreduce.c Outdated Show resolved Hide resolved

src/components/tl/ucp/allreduce/allreduce_ring.c Show resolved Hide resolved

src/components/tl/ucp/allreduce/allreduce.h Outdated Show resolved Hide resolved

src/components/tl/ucp/tl_ucp_coll.h Show resolved Hide resolved

src/components/tl/ucp/allreduce/allreduce_ring.c Show resolved Hide resolved

src/components/tl/ucp/allreduce/allreduce_ring.c Outdated Show resolved Hide resolved

@wfaderhold21 wfaderhold21 added the Ready-for-Review label

Mar 5, 2025

Copy link

Author

armratner commented Mar 6, 2025

Added the gtest,

Data Type and Operation Coverage
- INT32 with sum operation
- FLOAT32 with sum operation
- INT32 with product operation
- INT32 with max operation
- INT32 with min operation
- FLOAT64 with sum operation
Standard Test (ring)
- Tests the ring algorithm with standard configurations:
- Different data sizes (8, 65536, 123567 elements)
- Both in-place and non-in-place operations
- Different memory types (HOST, CUDA, CUDA_MANAGED where available)
- 3 iterations per configuration to ensure stability
Edge Case Test (ring_edge_cases)
- Tests the ring algorithm with non-power-of-two team sizes (3, 7, 13)
- Tests edge cases with message sizes:
- Empty message (0 elements)
- Single element (1 element)
- Small odd sizes (3, 17 elements)
Persistent Operation Test (ring_persistent)
- Tests the algorithm's behavior in persistent mode
- Uses a consistent buffer size (1024 elements)
- Runs 5 iterations to verify consistency

@armratner armratner force-pushed the all_reduce_ring_new branch from 04e5531 to d8396c6 Compare

March 7, 2025 04:03

@janjust janjust force-pushed the all_reduce_ring_new branch from 75c3b77 to 97ddcc2 Compare

March 20, 2025 16:08

@janjust janjust requested a review from nsarka

March 20, 2025 16:08

@janjust janjust requested a review from samnordmann

April 10, 2025 15:25

samnordmann

samnordmann reviewed

Copy link

Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, however I do not see the tests passing the CI logs. Can you push a new commit to Trigger the CI? @janjust do we know why the Jenkin CI hasnt been triggered?

I think the algo might run into a segfault because the executor is not initialized

src/components/tl/ucp/allreduce/allreduce_ring.c Show resolved Hide resolved

src/components/tl/ucp/allreduce/allreduce.c Show resolved Hide resolved

src/components/tl/ucp/allreduce/allreduce_ring.c Outdated Show resolved Hide resolved

nsarka

nsarka reviewed

Copy link

Collaborator

@nsarka nsarka left a comment •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the alg!

src/components/tl/ucp/allreduce/allreduce_ring.c Show resolved Hide resolved

src/components/tl/ucp/allreduce/allreduce_ring.c Outdated Show resolved Hide resolved

@janjust

Copy link

Collaborator

janjust commented Apr 15, 2025

@janjust do we know why the Jenkin CI hasnt been triggered?
Not sure - we had issues with jenkins for a bit now, but working with Artem and Andrii to resolve it - should be good now, we can just retrigger.

ikryukov

ikryukov reviewed

src/components/tl/ucp/allreduce/allreduce.h Show resolved Hide resolved

@armratner armratner force-pushed the all_reduce_ring_new branch from 4c4d9a8 to 4d0548b Compare

April 15, 2025 20:38

@armratner armratner requested review from ikryukov, nsarka and samnordmann

April 15, 2025 20:44

nsarka

nsarka reviewed

src/components/tl/ucp/allreduce/allreduce_ring.c Outdated Show resolved Hide resolved

test/gtest/coll/test_allreduce.cc Outdated Show resolved Hide resolved

nsarka

nsarka reviewed

Copy link

Collaborator

@nsarka nsarka left a comment •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please double check the gtest, it may not have been running the ring alg during the test

Sergei-Lebedev

Sergei-Lebedev reviewed

Apr 16, 2025

src/components/tl/ucp/allreduce/allreduce_ring.c Outdated Show resolved Hide resolved

@armratner armratner force-pushed the all_reduce_ring_new branch from 4d0548b to 5ffe341 Compare

April 17, 2025 02:38

@armratner armratner requested review from Sergei-Lebedev and nsarka

April 17, 2025 02:39

@armratner armratner force-pushed the all_reduce_ring_new branch 2 times, most recently from e120c3f to 5bd39b5 Compare

April 23, 2025 15:49

Copy link

Author

armratner commented Apr 23, 2025

| Message Size
Emphasising the Latency Jump at 8 MB

(MB)	UCC Ring (μs)	UCC Knomial (μs)	OMPI Tuned (μs)
1	2 287.61	579.78	3 710.72
2	2 656.15	983.58	7 397.25
4	4 380.01	3 531.90	15 340.96
8	11 228.21	12 728.49	32 010.70
16	26 984.31	31 163.98	70 183.02
32	50 100.87	69 753.03	154 048.94
64	100 628.53	145 987.70	314 291.89
128	200 764.59	295 283.81	638 286.61
256	405 462.40	590 662.48	—
512	992 232.12	1 182 699.49	—