-
Notifications
You must be signed in to change notification settings - Fork 1.2k
rocm: fix distributed inference on unified-memory APUs (strix halo / gfx1151)#407
Open
kyuz0 wants to merge 3 commits into
Open
rocm: fix distributed inference on unified-memory APUs (strix halo / gfx1151) #407kyuz0 wants to merge 3 commits into
kyuz0 wants to merge 3 commits into
Conversation
cuda_model_copy_chunked() allocated and copied the full model_size regardless of the map_offset/map_size parameters. For distributed workers using --layers, this tried to allocate the entire model (e.g. ~160 GiB) when only the assigned span range was needed (e.g. ~75 GiB), causing an out-of-memory failure on unified memory APUs like Strix Halo. The fix makes the device image track its file offset via a new device_offset field in cuda_model_image. cuda_model_copy_chunked() now allocates only map_size bytes and copies from map_offset, and cuda_model_image_ptr() subtracts device_offset when indexing into the device buffer so existing tensor lookups remain correct.
When a distributed node loads non-contiguous spans (e.g. a coordinator with layers 0:21 plus the output head at EOF), the bounding box covers nearly the entire model file. A single bulk copy of that range would allocate the full model and OOM. Sort the incoming spans, merge adjacent ones (within a 64 KiB gap), and issue a separate cuda_model_copy_chunked() for each contiguous group. Each group gets its own device image entry. cuda_model_image_ptr() now searches all images for the one covering the requested offset, and cuda_model_copy_chunked() no longer short-circuits when an image already exists for the model. This allows multiple disjoint images to coexist for the same model_map. The arena allocator also retries with smaller chunks when the preferred 1792 MiB allocation fails, handling memory-tight setups where only a few hundred MiB of headroom remain. For contiguous spans (e.g. a worker with layers 22:output where the bounding box is tight), the existing single-allocation path is preserved.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes the ROCm backend's
ds4_gpu_set_model_map_spans()andcuda_model_copy_chunked()to correctly handle split-model loading for distributed inference on unified-memory APUs (tested on Strix Halo / gfx1151).Why the existing code didn't work
The ROCm backend copies model tensors into device memory via
cuda_model_copy_chunked(). Unlike the CUDA backend, which usescudaHostRegisterto give the GPU direct access to the host mmap, the ROCm backend explicitly allocates and copies memory on Strix Halo.Two issues:
cuda_model_copy_chunked()allocatedmodel_sizebytes regardless of which layers were assigned. A distributed worker loading layers 22–42 (~75 GiB) would try to allocate the full model (~160 GiB), OOMing on a 128 GiB machine.ds4_gpu_set_model_map_spans()computed a bounding box over all spans and copied it as one contiguous range. A coordinator with layers 0:21 plus the output head at EOF has a bounding box covering nearly the full model file, even though the actual data is much smaller.Changes
One file:
rocm/ds4_rocm_runtime.cuh.device_offsettocuda_model_imageto track where each device buffer maps to in the file.cuda_model_copy_chunked()now allocates and copies onlymap_sizebytes starting frommap_offset, not the full model.cuda_model_image_ptr()searches all images and subtractsdevice_offsetwhen indexing, so existing tensor lookups work with partial images.ds4_gpu_set_model_map_spans()detects when the bounding box has large gaps (>10% waste). When it does, it sorts spans, merges adjacent ones (within 64 KiB), and issues a separatecuda_model_copy_chunked()per contiguous group. When spans are tight, the single-copy path is preserved.Testing & Benchmarking
Tested on two Strix Halo nodes (128 GiB each, gfx1151), running the Q4 imatrix model (
DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2-imatrix, ~153 GiB) with coordinator on layers 0:21 and worker on layers 22:output.- Stability: Ran SWE-bench Verified (mini) end-to-end. Over 24 hours of continuous inference without crashes. Results: https://pi-local-coding-bench.dev/
image- Performance: Prefill and decode benchmarks across context sizes: https://kyuz0.github.io/strix-halo-ds4-toolbox/
image