-
Notifications
You must be signed in to change notification settings - Fork 124
TL/UCP: Add all-reduce ring alogorithm #1082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
swx-jenkins3
commented
Feb 25, 2025
Can one of the admins verify this patch?
armratner
commented
Feb 25, 2025
Working on Gtest
armratner
commented
Mar 6, 2025
Added the gtest,
- Data Type and Operation Coverage
- INT32 with sum operation
- FLOAT32 with sum operation
- INT32 with product operation
- INT32 with max operation
- INT32 with min operation
- FLOAT64 with sum operation
- Standard Test (ring)
- Tests the ring algorithm with standard configurations:
- Different data sizes (8, 65536, 123567 elements)
- Both in-place and non-in-place operations
- Different memory types (HOST, CUDA, CUDA_MANAGED where available)
- 3 iterations per configuration to ensure stability - Edge Case Test (ring_edge_cases)
- Tests the ring algorithm with non-power-of-two team sizes (3, 7, 13)
- Tests edge cases with message sizes:
- Empty message (0 elements)
- Single element (1 element)
- Small odd sizes (3, 17 elements) - Persistent Operation Test (ring_persistent)
- Tests the algorithm's behavior in persistent mode
- Uses a consistent buffer size (1024 elements)
- Runs 5 iterations to verify consistency
04e5531 to
d8396c6
Compare
75c3b77 to
97ddcc2
Compare
@samnordmann
samnordmann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, however I do not see the tests passing the CI logs. Can you push a new commit to Trigger the CI? @janjust do we know why the Jenkin CI hasnt been triggered?
I think the algo might run into a segfault because the executor is not initialized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the alg!
janjust
commented
Apr 15, 2025
@janjust do we know why the Jenkin CI hasnt been triggered?
Not sure - we had issues with jenkins for a bit now, but working with Artem and Andrii to resolve it - should be good now, we can just retrigger.
4c4d9a8 to
4d0548b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please double check the gtest, it may not have been running the ring alg during the test
4d0548b to
5ffe341
Compare
e120c3f to
5bd39b5
Compare
armratner
commented
Apr 23, 2025
|
| Message Size
|
armratner
commented
Apr 24, 2025
Scale context – 128 ranks / 4 nodes / 32 PPN
5bd39b5 to
d11bff0
Compare
Signed-off-by: Armen Ratner <armeng@nvidia.com>
- Add tests for various data types and reduction operations - Test edge cases with non-power-of-two team sizes and odd message sizes - Test persistent operations for stability - Test with different memory types (HOST, CUDA, CUDA_MANAGED where available) Signed-off-by: Armen Ratner <armeng@nvidia.com>
d11bff0 to
99a8b15
Compare
What
This PR adds a new ring-based Allreduce algorithm (named
"ring") to the UCP transport layer within UCC. It introduces:allreduce_ring.cimplementing the ring-based method.Makefile.am) to include the new file.allreduce.[ch]), including a new enum valueUCC_TL_UCP_ALLREDUCE_ALG_RING, new function prototypes, and references in the algorithm registration.allreduce_ring.cthat manages per-rank scratch buffers, chunk-based sending/receiving, and reduction.Why ?
A ring-based Allreduce can be more efficient for large message sizes, especially on relatively simple or homogeneous network topologies. It complements existing Allreduce algorithms (e.g., knomial, sliding window, DBT) by providing:
How ?
The ring algorithm splits the input data into chunks, then circulates these chunks around the ring of ranks. Each rank performs local partial reductions on received data and passes it along. The main changes include:
File Additions/Modifications:
allreduce_ring.c: Implements the ring-based send/recv steps, in-place or out-of-place usage, and partial data reductions viaucc_dt_reduce.Makefile.am: Includes the new file in the build.allreduce.c/allreduce.h: Adds the new"ring"algorithm ID and associated function prototypes.Implementation Details:
num_chunks, typically equal to the number of ranks. Each chunk is passed around the ring (sendto/recvfrom) and reduced in a scratch buffer.scratchbuffer is allocated per rank to hold incoming chunk data before reduction.Code Flow:
ucc_dt_reduceon each incoming portion.By adding this ring-based approach, UCC gains a more complete suite of collective algorithms for Allreduce, allowing users and internal heuristics to pick the best method based on message size, topology, and system capabilities.