rocm: fix distributed inference on unified-memory APUs (strix halo / gfx1151)#407

Open

kyuz0 wants to merge 3 commits into

antirez:main from

kyuz0:rocm-multi-node

Open

rocm: fix distributed inference on unified-memory APUs (strix halo / gfx1151) #407
kyuz0 wants to merge 3 commits into
antirez:main from
kyuz0:rocm-multi-node

Conversation

@kyuz0

@kyuz0 kyuz0 commented Jun 13, 2026

Copy link

Copy Markdown

Fixes the ROCm backend's ds4_gpu_set_model_map_spans() and cuda_model_copy_chunked() to correctly handle split-model loading for distributed inference on unified-memory APUs (tested on Strix Halo / gfx1151).

Why the existing code didn't work

The ROCm backend copies model tensors into device memory via cuda_model_copy_chunked(). Unlike the CUDA backend, which uses cudaHostRegister to give the GPU direct access to the host mmap, the ROCm backend explicitly allocates and copies memory on Strix Halo.

Two issues:

cuda_model_copy_chunked() allocated model_size bytes regardless of which layers were assigned. A distributed worker loading layers 22–42 (~75 GiB) would try to allocate the full model (~160 GiB), OOMing on a 128 GiB machine.
ds4_gpu_set_model_map_spans() computed a bounding box over all spans and copied it as one contiguous range. A coordinator with layers 0:21 plus the output head at EOF has a bounding box covering nearly the full model file, even though the actual data is much smaller.

Changes

One file: rocm/ds4_rocm_runtime.cuh.

Add device_offset to cuda_model_image to track where each device buffer maps to in the file.
cuda_model_copy_chunked() now allocates and copies only map_size bytes starting from map_offset, not the full model.
cuda_model_image_ptr() searches all images and subtracts device_offset when indexing, so existing tensor lookups work with partial images.
ds4_gpu_set_model_map_spans() detects when the bounding box has large gaps (>10% waste). When it does, it sorts spans, merges adjacent ones (within 64 KiB), and issues a separate cuda_model_copy_chunked() per contiguous group. When spans are tight, the single-copy path is preserved.
The arena allocator retries with smaller chunks when the preferred 1792 MiB allocation fails, handling setups with limited headroom.

Testing & Benchmarking

Tested on two Strix Halo nodes (128 GiB each, gfx1151), running the Q4 imatrix model (DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2-imatrix, ~153 GiB) with coordinator on layers 0:21 and worker on layers 22:output.

Stability: Ran SWE-bench Verified (mini) end-to-end. Over 24 hours of continuous inference without crashes. Results: https://pi-local-coding-bench.dev/

image

Performance: Prefill and decode benchmarks across context sizes: https://kyuz0.github.io/strix-halo-ds4-toolbox/

image

Donato Capitella added 3 commits

June 10, 2026 18:17


 rocm: fix split-model memory allocation for distributed inference

96abadf

cuda_model_copy_chunked() allocated and copied the full model_size
regardless of the map_offset/map_size parameters. For distributed
workers using --layers, this tried to allocate the entire model
(e.g. ~160 GiB) when only the assigned span range was needed
(e.g. ~75 GiB), causing an out-of-memory failure on unified memory
APUs like Strix Halo.
The fix makes the device image track its file offset via a new
device_offset field in cuda_model_image. cuda_model_copy_chunked()
now allocates only map_size bytes and copies from map_offset,
and cuda_model_image_ptr() subtracts device_offset when indexing
into the device buffer so existing tensor lookups remain correct.


 rocm: support non-contiguous span groups for distributed inference

9fb79b4

When a distributed node loads non-contiguous spans (e.g. a
coordinator with layers 0:21 plus the output head at EOF), the
bounding box covers nearly the entire model file. A single bulk
copy of that range would allocate the full model and OOM.
Sort the incoming spans, merge adjacent ones (within a 64 KiB gap),
and issue a separate cuda_model_copy_chunked() for each contiguous
group. Each group gets its own device image entry.
cuda_model_image_ptr() now searches all images for the one covering
the requested offset, and cuda_model_copy_chunked() no longer
short-circuits when an image already exists for the model. This
allows multiple disjoint images to coexist for the same model_map.
The arena allocator also retries with smaller chunks when the
preferred 1792 MiB allocation fails, handling memory-tight setups
where only a few hundred MiB of headroom remain.
For contiguous spans (e.g. a worker with layers 22:output where the
bounding box is tight), the existing single-allocation path is
preserved.


 rocm: add comments for non-obvious distributed inference decisions

00e64ea

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rocm: fix distributed inference on unified-memory APUs (strix halo / gfx1151)#407

rocm: fix distributed inference on unified-memory APUs (strix halo / gfx1151) #407
kyuz0 wants to merge 3 commits into
antirez:main from
kyuz0:rocm-multi-node

Conversation

@kyuz0 kyuz0 commented Jun 13, 2026

Why the existing code didn't work

Changes

Testing & Benchmarking

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant