Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Seqev/dcr-attention_v3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

8 Commits

Repository files navigation

DCR-Attention v3.1

DOI

Sparse-attention KV-cache work on Llama-3.2-1B (RTX 4060 Ti). This repo documents both what works and what does not — including findings we retracted before publication.

Result

At the hero configuration (N=32K, B=4, c=0.15), M-class kernel work moved decode latency from sub-parity to above parity vs SDPA:

Path e2e (ms) vs SDPA vs M4
SDPA 198.64 ×ばつ
M4 (v3.0) 228.43 ×ばつ ×ばつ
M6 + M5-mixed (v3.1) 187.29 ×ばつ ×ばつ

Clean theoretical ceiling (0-cost attention kernel): ×ばつ. Production captures ~26% of the M4-parity → ceiling gap.

Numbers are canonical: 50-iter warmup, 30 timed, 3 randomized-order sessions, hero variance 0.098%.

What the work actually contributes

The value is in the science, not the speedup. Eight characterized negative results:

# Statement
1 Synthetic-data quality validation does not transfer to real LLMs.
2 Mean-K statistics are dominated by attention sinks; static projection axes are a dead end.
3 Index-set overlap is a false quality metric; only output cosine similarity is a valid gate.
4 Top-K trajectory drift saturates (~47% by step 50); no warm-start amortization — per-Q-step is structurally necessary.
5 Dispatch/non-attention overhead is orthogonal to the algorithmic ratio.
6 cuBLAS degrades at small GEMM dims (M=4 GQA): no Tensor-Core engagement, launch overhead dominates.
7 INT4 KV is infeasible on 1B-class models; KIVI's <0.1% PPL claim does not scale down (outlier-driven, not sink-driven).
8 Sequential benchmarking with low warmup produces ~9% optimistic bias for later-measured paths. Rigorous comparison needs randomized-order, high-warmup, multi-session protocols.

Postulate 8 is the centerpiece: an earlier intermediate claim of ×ばつ hero was corrected to ×ばつ by canonical re-measurement. We caught our own optimistic bias pre-publication because the project was built to catch it.

Structure

docs/paper_rewrite_scope_memo.md Scope memo for the v3.1 paper (incl. retraction ledger)
results/ Canonical measurements + key falsification artifacts
REPRODUCIBILITY.md Env, seeds, protocol

Status

Work-in-progress. This drop is the scope memo + measurement artifacts; the full v3.1 manuscript is a separate forthcoming rewrite. The retraction ledger (scope memo §5) is kept public deliberately as a discipline record.

Environment

Llama-3.2-1B · RTX 4060 Ti · torch 2.5.1+cu121 · triton 3.1.0 · seed 0

About

Top-K sparse attention for long-context decode on Llama-3.2-1B (v3 release)

Resources

License

Stars

Watchers

Forks

Packages

Contributors

AltStyle によって変換されたページ (->オリジナル) /