-
Notifications
You must be signed in to change notification settings - Fork 565
Missing ANN regime: 2–4-bit scalar-quantized FastScan search index #520
Description
Missing ANN regime: 2–4-bit scalar-quantized FastScan search index
Summary
ruvector has strong ANN coverage but is missing the 2–4-bit scalar-quantized search regime that every mainstream production vector DB lives in (FAISS IndexPQFastScan, Qdrant, Milvus IVF-SQ). This issue makes the case for adding it as a small, self-contained crate (ruvector-turbovec, proposed in ADR-194). A working scalar-reference milestone (M1) is implemented and validated — see the companion PR #521 .
The gap
| Crate | Index family | Quantization | Suitable for "search 10M vectors for nearest 10"? |
|---|---|---|---|
ruvector-core |
HNSW (graph) | none (f32) | yes, but full f32 memory |
ruvector-diskann |
DiskANN (graph) | none / PQ-ish | yes |
ruvector-rairs |
IVF (ADR-193) | optional | yes |
ruvector-rabitq |
Flat + rotation | 1-bit only | yes, but 1-bit caps recall |
ruvllm turbo_quant.rs |
— (value codec) | 2.5–4 bit | no — KV-cache compressor, not an index |
Two things sit near this regime but neither closes it:
ruvector-rabitqis 1-bit. It's excellent when memory dominates, but on 1536–3072-dim production embeddings (OpenAItext-embedding-3, Cohere, Voyage) you need an exact-rerank pass over the raw f32 originals to hit production R@1 — which re-inflates memory back toward the f32 footprint, partly defeating the point.ruvllm/src/quantize/turbo_quant.rsis a TurboQuant value codec for transformer KV-cache / embedding compression. It has no inverted lists, no top-k heap, no FastScan LUT kernel, no filtered/allowlist search, no stable IDs/deletion. Wrong abstraction for search.
What's missing, concretely
- 2–4 bits per dimension (not 1, not 32).
- Codebook-free scalar quantization — no k-means training, online ingest.
- A FastScan-style nibble-LUT SIMD kernel that scores 16/32 candidates per vector instruction without ever materializing f32.
- Competitive recall without a mandatory f32 rerank pass.
Why it matters
- Memory/recall trade-off that's actually new. The per-vector length-renormalization scalar (
c_x) makes the inner-product estimator unbiased, so 2/4-bit recall holds without the rerank tax that 1-bit RaBitQ pays. The memory win is real, not borrowed-back at query time. - Parity with the field. It puts ruvector in the same regime as FAISS
IndexPQFastScan/ Milvus IVF-SQ, which is what most teams reach for when they outgrow flat f32 but can't afford 1-bit's recall hit. - Reuse, not sprawl. It implements the existing
ruvector_rabitq::AnnIndextrait and reusesRandomRotation— so it drops into the same dispatcher/consumers with no new plumbing, and adds exactly one workspace crate. - Roadmap fit. The same block-SoA codes can later back an IVF posting list (IVF-SQ-FastScan) — a natural composition with
ruvector-rairs(ADR-193). It's also adjacent to the in-flight SymphonyQG research (research(nightly): SymphonyQG — graph-coupled 4-bit FastScan neighbor scoring #443 , research(nightly): symphonyqg — co-designed 1-bit graph quantization (SIGMOD 2025) #428 : graph-coupled 4-bit FastScan scoring), and complements the merged TurboQuant KV-cache work (feat(ruvllm): TurboQuant KV cache & vector compression #297 ) by covering the search-index side those didn't.
Evidence it works (M1, measured)
n=5,000 uniform-random vectors (worst case for ANN), dim=256, k=10, no f32 rerank, vs exact brute-force L2:
| Width | recall@10 | compression | mean cosine bias |
|---|---|---|---|
| 1-bit | 0.308 | ×ばつ | +0.0005 |
| 2-bit | 0.561 | ×ばつ | +0.0001 |
| 4-bit | 0.879 | ×ばつ | −0.0000 |
Recall rises monotonically with bit-width; near-zero bias confirms the unbiased estimator. 16 unit + 1 doc-test pass; clippy clean. Reproduce with cargo run --release -p ruvector-turbovec.
Proposed path
Tracked by ADR-194. Incremental milestones:
- M1 — scalar reference (this PR): rotation reuse + Lloyd–Max 2/4-bit SQ + TQ+ calibration + unbiased scoring +
IdMapIndex(O(1) delete, filtered search). ✅ implemented & validated. - M2 — FastScan nibble-LUT SIMD kernel (AVX2 + NEON), fuzzed bit-identical to the scalar oracle.
- M3 —
.tvpersistence. - M4 — AVX-512BW kernel +
ruvector-rulakedispatcher registration.
The scalar M1 scorer is deliberately the determinism oracle the SIMD kernels must match bit-for-bit, so the perf work lands without risking correctness.
Ask
Adopt ADR-194 and land M1 (#521) as the foundation, then iterate on the SIMD milestones. Feedback welcome on the reuse boundary (rabitq trait/rotation), the ADR-193 IVF composition, and whether the Lloyd–Max math should be hoisted into a shared ruvector-math crate rather than living in the codec.