Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[RCCL] Fix ncclCommGrow hang when growing to 8-rank single-node comm#7231

Merged
nikitaxgusev merged 1 commit into
develop from
users/atulkulk/fix-grow-dda-ipc-hang
Jun 15, 2026
Merged

[RCCL] Fix ncclCommGrow hang when growing to 8-rank single-node comm #7231
nikitaxgusev merged 1 commit into
develop from
users/atulkulk/fix-grow-dda-ipc-hang

Conversation

@atulkulk

@atulkulk atulkulk commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Motivation

ncclCommGrow hangs and times out when growing a communicator to exactly 8 ranks on a single node, blocking the elastic-grow feature. This PR fixes the deadlock so a grow to a full 8-GPU communicator completes reliably.

Technical Details

DDA IPC init (ncclDdaIpcCommInit) was gated only on !job->parent, so during a grow the new joining rank (no parent) entered it while existing ranks (with a parent) skipped it, deadlocking the collective bootstrapAllGather inside. The fix adds !job->isGrow to the gate in ncclCommInitRankFunc (src/init.cc) so all ranks consistently skip DDA during a grow, leaving fresh init and split paths unchanged.

JIRA ID

Resolves AICOMRCCL-1262.

Test Plan

Ran the MPI unit test GrowMPITest.* with 8 ranks on a single MI300X node, which reproduced the hang prior to the change.

Test Result

The previously hanging test now completes init on all 8 ranks and passes the AllReduce verification, with no 300s timeout.

Submission Checklist

Skip DDA IPC init during ncclCommGrow. The DDA path was gated only on
!job->parent, so on a grow the new joining rank (parent == NULL) entered
ncclDdaIpcCommInit while existing ranks (parent != NULL) skipped it. The
collective bootstrapAllGather inside DDA init then deadlocked, since only
one rank participated, causing the grown communicator to hang and the test
to time out.
Add !job->isGrow to the gate so all ranks consistently skip DDA during a
grow. Fresh init and split behavior are unchanged.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a deadlock in RCCL communicator growth when expanding to an 8-rank, single-node communicator by ensuring the DDA IPC init path is not entered during ncclCommGrow, avoiding inconsistent participation in the DDA bootstrap collectives.

Changes:

  • Gate ncclDdaIpcCommInit() behind !job->isGrow so grow operations consistently skip DDA IPC initialization.
  • Preserve existing behavior for fresh communicator init and split/shrink paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread projects/rccl/src/init.cc

Copy link
Copy Markdown
Contributor

It looks like this issue was introduced by the new condition added recently 3dc4fb1:

if (!job->parent && comm->nNodes == 1 && comm->nRanks == 8) {

My understanding is that CI was not fully working at that time, so the issue was not caught before merge. Once CI was relaunched, this failure showed up.

@nikitaxgusev nikitaxgusev merged commit 088002b into develop Jun 15, 2026
15 checks passed
@nikitaxgusev nikitaxgusev deleted the users/atulkulk/fix-grow-dda-ipc-hang branch June 15, 2026 08:06
yalmusaf pushed a commit that referenced this pull request Jun 18, 2026
...7231)
## Motivation
ncclCommGrow hangs and times out when growing a communicator to exactly
8 ranks on a single node, blocking the elastic-grow feature. This PR
fixes the deadlock so a grow to a full 8-GPU communicator completes
reliably.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

Copilot code review Copilot Copilot left review comments
@nikitaxgusev nikitaxgusev nikitaxgusev approved these changes
@thomas-huber thomas-huber thomas-huber approved these changes

Assignees

No one assigned

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

AltStyle によって変換されたページ (->オリジナル) /