Summary
This fixes the surviving HIP FlashAttention routing hole on the v0.3.2 two-GPU tensor-split + draft-mtp serving path for Gemma 4.
The original version of this PR fixed one D=256 route hole, but repeated-request serving was still not stable. With prompt reuse enabled, the server could still abort on later turns after slot reuse and checkpoint restore.
The updated change keeps the failing path serving-safe by widening the HIP D=256 fallback and by logging the full FA route inputs at the abort site.
Reproduction
Command under test:
GGML_CUDA_ALLREDUCE=nccl HIP_VISIBLE_DEVICES=0,1 build-rocm-rccl/bin/llama-server \
-hf "unsloth/gemma-4-31b-it-GGUF:UD-Q4_K_XL" \
--spec-type draft-mtp \
--spec-draft-n-max 10 \
--host 0.0.0.0 \
--port 8080 \
-fa on \
--reasoning on \
--reasoning-loop-min-tokens 16384 \
-ngl 999 \
-fit off \
--temp 1.0 --top-p 0.95 --top-k 64 \
--ctx-size 32768 \
-np 1 \
--threads 8 \
--mmap \
--no-mmproj \
--cache-ram 0 \
-sm tensor \
-ts 1,1 \
-ctk f16 \
-ctv f16 \
-b 2048 -ub 512 \
--metrics \
--log-timestamps
RCCL is active in the tested build:
librccl.so.1 is linked into libggml-hip.so
GGML_CUDA_NCCL:BOOL=ON
GGML_HIP_RCCL:BOOL=ON
Observed failing pattern before the update:
- first request succeeds
- second request may succeed
- later request after web-UI style chat reuse can abort with:
No CUDA FA kernel selected: K=f16 V=f16 D=256
The important part is not browser UI specifically, but the request shape it drives:
- same slot reused by high LCP similarity
- restored context checkpoint
- MTP draft decode over reused conversation state
Root Cause
The crash is in the MTP draft decode path:
common_speculative_impl_draft_mtp::draft
llama_decode
ggml_cuda_flash_attn_ext
This was not an RCCL failure and not a generic tensor-split failure. It was a HIP FlashAttention route-selection hole that only shows up on reused-context draft graphs.
For the failing path, the effective FA inputs are still a legal f16/f16 D=256 attention shape, but the planner could still return BEST_FATTN_KERNEL_NONE after prompt reuse / checkpoint restore. That fed the unconditional abort in ggml_cuda_flash_attn_ext().
Change
- Broaden the HIP tile fallback in
ggml_cuda_fattn_make_route_plan().
Previously the fallback was too narrow. It now covers HIP f16/f16 D=256 reused-context shapes as long as the K sequence length is stride-compatible, instead of rejecting them because of an over-strict mask gate.
- Add targeted route diagnostics.
When route debug is enabled, the planner now logs:
- Q/K/V shapes
- mask shape and strides
- raw and effective K/V types
- selected kernel
none_reason when selection falls through
If the selector ever still reaches BEST_FATTN_KERNEL_NONE, the abort log now prints the full failing node shape instead of only K, V, and D.
Validation
Built the HIP backend with:
cmake --build build-rocm-rccl --config RelWithDebInfo --target ggml-hip -j 8
Validated with a persistent single-slot chat conversation against the same server config, including:
- LCP-based slot reuse
- restored context checkpoints
- repeated short turns after a long code-generation turn
- repeated long + short mixed turns in one server process
Observed post-fix behavior on the test host:
- restored checkpoint at
pos_min = 271, pos_max = 1806, size = 800.013 MiB
- repeated high-LCP reuse (
sim_best up to 0.995)
- no FA abort through 8 sequential same-thread requests
- graphs reused continued increasing (
324 -> 1001 in the captured run)
Representative timings from the reused-context run:
- initial long prompt:
175.87 tok/s prompt, 77.91 tok/s eval, acceptance 0.56006
- later reused short turns: prompt around
81-85 tok/s, eval around 40-46 tok/s
- later reused long turn:
517.23 tok/s prompt, 41.90 tok/s eval after restored checkpoint
Scope
This PR changes only ggml/src/ggml-cuda/fattn.cu.
It does not change:
- tensor split policy
- MTP scheduling
- RCCL setup
- server checkpoint logic
- sampler placement logic
There is a separate local sampler fallback fix outside this PR; it is intentionally not included here.
Impact
This makes the HIP two-GPU tensor-split + draft-mtp serving path more robust under real repeated-request chat reuse by using a safe tile fallback for the remaining f16/f16 D=256 route hole and by making any future selector failure self-describing.
Uh oh!
There was an error while loading. Please reload this page.
Summary
This fixes the surviving HIP FlashAttention routing hole on the
v0.3.2two-GPU tensor-split +draft-mtpserving path for Gemma 4.The original version of this PR fixed one
D=256route hole, but repeated-request serving was still not stable. With prompt reuse enabled, the server could still abort on later turns after slot reuse and checkpoint restore.The updated change keeps the failing path serving-safe by widening the HIP
D=256fallback and by logging the full FA route inputs at the abort site.Reproduction
Command under test:
GGML_CUDA_ALLREDUCE=nccl HIP_VISIBLE_DEVICES=0,1 build-rocm-rccl/bin/llama-server \ -hf "unsloth/gemma-4-31b-it-GGUF:UD-Q4_K_XL" \ --spec-type draft-mtp \ --spec-draft-n-max 10 \ --host 0.0.0.0 \ --port 8080 \ -fa on \ --reasoning on \ --reasoning-loop-min-tokens 16384 \ -ngl 999 \ -fit off \ --temp 1.0 --top-p 0.95 --top-k 64 \ --ctx-size 32768 \ -np 1 \ --threads 8 \ --mmap \ --no-mmproj \ --cache-ram 0 \ -sm tensor \ -ts 1,1 \ -ctk f16 \ -ctv f16 \ -b 2048 -ub 512 \ --metrics \ --log-timestampsRCCL is active in the tested build:
librccl.so.1is linked intolibggml-hip.soGGML_CUDA_NCCL:BOOL=ONGGML_HIP_RCCL:BOOL=ONObserved failing pattern before the update:
The important part is not browser UI specifically, but the request shape it drives:
Root Cause
The crash is in the MTP draft decode path:
common_speculative_impl_draft_mtp::draftllama_decodeggml_cuda_flash_attn_extThis was not an RCCL failure and not a generic tensor-split failure. It was a HIP FlashAttention route-selection hole that only shows up on reused-context draft graphs.
For the failing path, the effective FA inputs are still a legal
f16/f16 D=256attention shape, but the planner could still returnBEST_FATTN_KERNEL_NONEafter prompt reuse / checkpoint restore. That fed the unconditional abort inggml_cuda_flash_attn_ext().Change
ggml_cuda_fattn_make_route_plan().Previously the fallback was too narrow. It now covers HIP
f16/f16 D=256reused-context shapes as long as the K sequence length is stride-compatible, instead of rejecting them because of an over-strict mask gate.When route debug is enabled, the planner now logs:
none_reasonwhen selection falls throughIf the selector ever still reaches
BEST_FATTN_KERNEL_NONE, the abort log now prints the full failing node shape instead of onlyK,V, andD.Validation
Built the HIP backend with:
Validated with a persistent single-slot chat conversation against the same server config, including:
Observed post-fix behavior on the test host:
pos_min = 271, pos_max = 1806, size = 800.013 MiBsim_bestup to0.995)324 -> 1001in the captured run)Representative timings from the reused-context run:
175.87 tok/sprompt,77.91 tok/seval, acceptance0.5600681-85 tok/s, eval around40-46 tok/s517.23 tok/sprompt,41.90 tok/seval after restored checkpointScope
This PR changes only
ggml/src/ggml-cuda/fattn.cu.It does not change:
There is a separate local sampler fallback fix outside this PR; it is intentionally not included here.
Impact
This makes the HIP two-GPU tensor-split +
draft-mtpserving path more robust under real repeated-request chat reuse by using a safe tile fallback for the remainingf16/f16 D=256route hole and by making any future selector failure self-describing.