The Numbers
Per-user generation speed:
- V4-Flash: 60–85% faster than MTP-1 baseline
- V4-Pro: 57–78% faster at matched throughput
Acceptance length improvements:
- vs. Eagle3: 26.7–30.9% longer accepted sequences
- vs. DFlash: 16.3–18.4% improvement
Domain-specific confidence pruning:
- Chat acceptance: 45.7% → 95.7%
- Math reasoning: 76.9% → 92.5%
The Open-Source Play: DeepSpec
DSpark is not just an API upgrade. DeepSeek open-sourced DeepSpec, an MIT-licensed codebase for training and evaluating speculative decoding draft models. It supports DSpark, DFlash, and Eagle3 algorithms with configs for Qwen3 and Gemma4 targets.
The production checkpoints reuse existing V4 weights with an attached draft module — no target model retraining required.
Hardware vs. Algorithms
The hardware route: GPT-5.6 Sol on Cerebras at 750 tok/s. Requires a partnership, government access, deep pockets.
The algorithm route: DSpark on commodity GPUs. Up to 85% speed improvement, open-sourced, works on non-DeepSeek models.
DeepSeek V4 Flash scores 79.0% on SWE-bench Verified at 0ドル.14/0ドル.28 per million tokens — 150x cheaper than GPT-5.5 with input caching. Add DSpark's speed improvement on top and the gap widens further.
What This Means for Operators
-
Running DeepSeek V4? Attach the DSpark module. No retraining needed.
-
Running other open models? DeepSpec provides the training framework for Qwen3 and Gemma4.
-
Evaluating open vs. closed? The latency gap — the one area where custom silicon had a clear edge — is under direct attack.
You don't need a Cerebras contract or a government preview slot for fast inference. You need a good algorithm and the willingness to let anyone use it.
💡 DSpark checkpoints are live on Hugging Face. DeepSpec is MIT-licensed on GitHub.
Originally published at ComputeLeap