Qwen3.5-4B DeltaNet layers: FP32 dequant bottleneck causes 0.7 tok/s #70

Open

Labels

enhancement

@unamedkr

Description

@unamedkr

unamedkr

opened

on Apr 12, 2026

Description

Qwen3.5-4B loads and generates coherent output, but inference is extremely slow at ~0.7 tok/s on Apple M3. The bottleneck is the FP32 dequantization of DeltaNet attention layers at load time.

Benchmark

Model	Params	Vocab	tok/s	Notes
Phi-3.5-mini (Q8)	3.8B	32K	~8	Fast
SmolLM2-1.7B (Q8)	1.7B	49K	~12.5	Fastest
Qwen3.5-4B (Q4)	4B	248K	~0.7	18x slower than Phi-3.5

Root Cause

Server log shows all 24 DeltaNet layers being dequantized to FP32:

tq_load_gguf: layer 0 attn_qkv dequant to FP32 (was type 13)
tq_load_gguf: layer 1 attn_qkv dequant to FP32 (was type 13)
...
tq_load_gguf: layer 30 attn_qkv dequant to FP32 (was type 13)

Two bottlenecks:

DeltaNet FP32 dequant — 24 layers ×ばつ full QKV tensors converted to FP32 at load time, consuming massive memory and removing quantization speed benefits
248K vocab output projection — Every token requires a 2560 ×ばつ 248K matmul for logit computation. This is 7.7x larger than Phi-3.5's (3072 ×ばつ 32K).

Impact

At 0.7 tok/s, generating 80 tokens takes ~103 seconds — unusable for interactive chat. Despite Qwen3.5-4B having the best quality among tested models, the speed makes it impractical.

Suggested Optimizations

Keep DeltaNet layers in quantized format — use Q4/Q8 matmul directly instead of FP32 dequant
Optimize vocab projection — for large-vocab models, consider top-k logit computation or speculative sampling
DeltaNet-specific kernel — linear attention doesn't need full KV cache, leverage this for speed

Environment

Model: unsloth/Qwen3.5-4B-GGUF (Q4_K_M, 2.6GB)
Hardware: Apple M3, 8-core, 16GB
Build: quant.h single-header

Reported by ClawTeam Claw-4 (Optimizer)

Metadata

Assignees

No one assigned

Labels

enhancement

Type

No type

Fields

Give feedback

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3.5-4B DeltaNet layers: FP32 dequant bottleneck causes 0.7 tok/s #70

Description

Description

Benchmark

Root Cause

Impact

Suggested Optimizations

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions