-
Notifications
You must be signed in to change notification settings - Fork 294
[RCCL] Fix ncclCommGrow hang when growing to 8-rank single-node comm#7231
[RCCL] Fix ncclCommGrow hang when growing to 8-rank single-node comm #7231nikitaxgusev merged 1 commit into
Conversation
Skip DDA IPC init during ncclCommGrow. The DDA path was gated only on !job->parent, so on a grow the new joining rank (parent == NULL) entered ncclDdaIpcCommInit while existing ranks (parent != NULL) skipped it. The collective bootstrapAllGather inside DDA init then deadlocked, since only one rank participated, causing the grown communicator to hang and the test to time out. Add !job->isGrow to the gate so all ranks consistently skip DDA during a grow. Fresh init and split behavior are unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Fixes a deadlock in RCCL communicator growth when expanding to an 8-rank, single-node communicator by ensuring the DDA IPC init path is not entered during ncclCommGrow, avoiding inconsistent participation in the DDA bootstrap collectives.
Changes:
- Gate
ncclDdaIpcCommInit()behind!job->isGrowso grow operations consistently skip DDA IPC initialization. - Preserve existing behavior for fresh communicator init and split/shrink paths.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
nikitaxgusev
commented
Jun 15, 2026
It looks like this issue was introduced by the new condition added recently 3dc4fb1:
if (!job->parent && comm->nNodes == 1 && comm->nRanks == 8) {
My understanding is that CI was not fully working at that time, so the issue was not caught before merge. Once CI was relaunched, this failure showed up.
...7231) ## Motivation ncclCommGrow hangs and times out when growing a communicator to exactly 8 ranks on a single node, blocking the elastic-grow feature. This PR fixes the deadlock so a grow to a full 8-GPU communicator completes reliably.
Motivation
ncclCommGrow hangs and times out when growing a communicator to exactly 8 ranks on a single node, blocking the elastic-grow feature. This PR fixes the deadlock so a grow to a full 8-GPU communicator completes reliably.
Technical Details
DDA IPC init (ncclDdaIpcCommInit) was gated only on
!job->parent, so during a grow the new joining rank (no parent) entered it while existing ranks (with a parent) skipped it, deadlocking the collective bootstrapAllGather inside. The fix adds!job->isGrowto the gate in ncclCommInitRankFunc (src/init.cc) so all ranks consistently skip DDA during a grow, leaving fresh init and split paths unchanged.JIRA ID
Resolves AICOMRCCL-1262.
Test Plan
Ran the MPI unit test
GrowMPITest.*with 8 ranks on a single MI300X node, which reproduced the hang prior to the change.Test Result
The previously hanging test now completes init on all 8 ranks and passes the AllReduce verification, with no 300s timeout.
Submission Checklist