-
Notifications
You must be signed in to change notification settings - Fork 43
Qwen3.5-4B DeltaNet layers: FP32 dequant bottleneck causes 0.7 tok/s #70
Description
Description
Qwen3.5-4B loads and generates coherent output, but inference is extremely slow at ~0.7 tok/s on Apple M3. The bottleneck is the FP32 dequantization of DeltaNet attention layers at load time.
Benchmark
| Model | Params | Vocab | tok/s | Notes |
|---|---|---|---|---|
| Phi-3.5-mini (Q8) | 3.8B | 32K | ~8 | Fast |
| SmolLM2-1.7B (Q8) | 1.7B | 49K | ~12.5 | Fastest |
| Qwen3.5-4B (Q4) | 4B | 248K | ~0.7 | 18x slower than Phi-3.5 |
Root Cause
Server log shows all 24 DeltaNet layers being dequantized to FP32:
tq_load_gguf: layer 0 attn_qkv dequant to FP32 (was type 13)
tq_load_gguf: layer 1 attn_qkv dequant to FP32 (was type 13)
...
tq_load_gguf: layer 30 attn_qkv dequant to FP32 (was type 13)
Two bottlenecks:
-
DeltaNet FP32 dequant — 24 layers ×ばつ full QKV tensors converted to FP32 at load time, consuming massive memory and removing quantization speed benefits
-
248K vocab output projection — Every token requires a 2560 ×ばつ 248K matmul for logit computation. This is 7.7x larger than Phi-3.5's (3072 ×ばつ 32K).
Impact
At 0.7 tok/s, generating 80 tokens takes ~103 seconds — unusable for interactive chat. Despite Qwen3.5-4B having the best quality among tested models, the speed makes it impractical.
Suggested Optimizations
- Keep DeltaNet layers in quantized format — use Q4/Q8 matmul directly instead of FP32 dequant
- Optimize vocab projection — for large-vocab models, consider top-k logit computation or speculative sampling
- DeltaNet-specific kernel — linear attention doesn't need full KV cache, leverage this for speed
Environment
- Model: unsloth/Qwen3.5-4B-GGUF (Q4_K_M, 2.6GB)
- Hardware: Apple M3, 8-core, 16GB
- Build: quant.h single-header
Reported by ClawTeam Claw-4 (Optimizer)