Name	Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows	.github/workflows
bench	bench
docs	docs
src/tridec	src/tridec
tests	tests
.gitignore	.gitignore
CHANGELOG.md	CHANGELOG.md
CITATION.cff	CITATION.cff
LICENSE	LICENSE
README.md	README.md
pyproject.toml	pyproject.toml

tridec

Badge honesty: CI is CPU-only (ubuntu + macos arm64; the macos lane binds the strict exact-count receipt gates). There are no GPU runners — the CUDA/ROCm kernel paths are validated by the carried H200/MI300X receipts in bench/receipts/, and the experimental Metal tier runs on a local machine.

An open, vendor-portable GPU decoder library for quantum LDPC codes — Triton min-sum BP and Relay-BP decoders that consume any stim DetectorErrorModel or raw parity-check matrices, with CPU reference implementations, validated against the standard CPU references (ldpc, relay-bp), running on NVIDIA (CUDA) and AMD (ROCm) GPUs.

The same Triton kernels run unmodified on both vendors: the Relay-BP kernel reproduces its logical-error-rate validation numbers identically on an NVIDIA H200 (CUDA 12.4, triton 3.0) and an AMD MI300X (ROCm 7.0, triton 3.4) — see docs/benchmark.md and the raw receipts in bench/receipts/. Validated scope is NVIDIA + AMD; Apple silicon runs the same kernels through triton-metal as an experimental backend (see below).

v0.2: the megakernel backend (opt-in)

A single-launch persistent megakernel — the entire Relay-BP decode (every BP iteration, every relay leg, in-kernel syndrome convergence + nconv stop + lowest-weight selection) in one kernel launch per decode_batch, with per-shot early exit, instead of the v0.1 host loop's thousands of launches. Validated on all three platforms against the v0.1 two-kernel path and the relay-bp Rust oracle (14/14 gates at BLOCK 128/256 on CUDA and ROCm; barriers verified honored on both):

Relay-BP megakernel vs v0.1 two-kernel	speedup
Apple M4 Max (Metal, triton-metal)	×ばつ — 30.0 s → 0.152 s / 2000 shots (relay BLOCK=256, num_warps=8)
NVIDIA H200 (CUDA)	×ばつ fp32 (fp64 to ×ばつ mid-batch) — batch-1 62.5 → 3.44 ms; 34.6 μs/syn @8192
AMD MI300X (ROCm)	×ばつ — batch-1 8.48 ms; 46.0 μs/syn @8192

(Speedups are vs each platform's own v0.1 two-kernel path and vary with batch size — min–max across batch 1–16384. Absolute cross-vendor performance, where H200 leads, is in the limits below.) Receipt stacks: H200 (CUDA 12.4 / triton 3.0), MI300X (ROCm 6.2 / torch 2.5.1 / triton 3.1, gfx942), M4 Max (triton-metal, CODEGEN_VERSION 2026年06月13日); raw in bench/receipts/megakernel_* — the Metal block-lift re-measure is in megakernel_metal_lift.{md,json}. The Metal 30.0 s baseline and the v0.1 Apple section's 31 s below are independent measurement runs of the same two-kernel relay (run-to-run jitter), not a discrepancy.

Auto-dispatch (v0.2.1): tridec.from_dem(..., algorithm="relay") / RelayBpDecoder now use the megakernel by default on GPU — relay wins decisively, so it's the default; pass megakernel=False for the v0.1 two-kernel host loop. The path is GPU-gated by construction (RelayBpDecoder only accepts the triton/metal backends, never CPU). BP keeps the two-kernel default (BpMegaTriton stays opt-in via tridec.backends.megakernel — the plain-BP megakernel is a single-shot latency tool that loses at batch throughput). (#5). Receipts: bench/receipts/megakernel_{h200,mi300x,metal}*.

Megakernel: honest limits + tuning

Plain-BP megakernel is a single-shot latency tool, not a throughput tool. At batch-1 it is ×ばつ faster than the two-kernel BP path (H200 0.61 vs 1.06 ms); at large batch it loses (plain BP has no early-exit lever) — the two-kernel BP path stays the throughput default. Use BpMegaTriton for low-latency bare BP, RelayBpMegaTriton for the accurate latency path.
Real-time / single-shot: H200 leads MI300X ×ばつ at batch-1 (3.44 vs 8.48 ms) — wider than v0.1's ~9% two-kernel gap, because the single-CTA-per-shot design amplifies per-SM and codegen differences at batch-1. Batched, the gap is ×ばつ; correctness is identical across vendors. (The pitch is vendor-portable + performant on both, never parity.)
Per-arch autotuning. v0.2 ships autotuned BLOCK/num_warps configs for H200, MI300X and M4 Max, pinned in _CUDA_TUNED keyed by gcnArchName/device name. AMD (wavefront-64) wants the opposite shape from NVIDIA warps — low warps for BP, max BLOCK+warps for relay.
Metal: fully lifted off the old BLOCK=32 pin — both kernels at BLOCK=256: BP (256) (20 → 12 ms, ×ばつ) and relay (256, num_warps=8) (441 → 152 ms, ×ばつ — the ×ばつ headline above), relay bit-identical to BLOCK=128. The relay num_warps=8 is load-bearing: it sets num_threads = num_warps×ばつ32 = 256 = BLOCK so each thread handles exactly one element (n = BLOCK/num_threads = 1); at n>1 triton-metal's base path under-covers a BLOCK-wide store and now loudly refuses (MetalNonRecoverableError, never silent-wrong), so the footgun can't bite. Requires triton-metal with the in-loop-reduction + n=1-store fixes (older triton-metal loudly refuses relay@256). fp32-only on Metal (no fp64), same as the two-kernel path, and the fp32 near-tie-flip caveat below applies to the megakernel unchanged. Receipt: bench/receipts/megakernel_metal_lift.md.

Install

Most users want pip install "tridec[torch,decoders]" (CPU+GPU torch backend plus the reference adapters). The bare install is the numpy CPU reference only — correct but slow.

pip install tridec # numpy CPU reference only
pip install "tridec[torch]" # + batched torch backend (CPU/GPU)
pip install "tridec[gpu]" # + Triton GPU kernels (CUDA or ROCm)
pip install "tridec[decoders]" # + ldpc / relay-bp reference adapters
pip install "tridec[sinter]" # + sinter.collect integration

Quickstart

import stim
import tridec
circuit = stim.Circuit.from_file("memory.stim")
dem = circuit.detector_error_model(decompose_errors=False)
decoder = tridec.from_dem(dem, backend="auto") # triton > torch > numpy
dets, obs = circuit.compile_detector_sampler(seed=0).sample(
 100_000, separate_observables=True)
pred = decoder.decode_batch(dets) # (shots, n_obs) bool
print("logical error rate:", (pred != obs).any(axis=1).mean())

Raw matrices work too: tridec.from_matrices(H, priors, observables=Lo). Relay-BP: tridec.from_dem(dem, algorithm="relay") (Triton kernels only).

With sinter (the [sinter] extra):

import sinter
from tridec.sinter import sinter_decoders
stats = sinter.collect(
 num_workers=4, tasks=tasks,
 decoders=["tridec_bp", "pymatching"],
 custom_decoders=sinter_decoders(),
 max_shots=1_000_000)

Backend ×ばつ algorithm matrix (honest availability)

×ばつ algorithm matrix (honest availability)" href="#backend--algorithm-matrix-honest-availability">

Algorithm	`numpy`	`torch`	`triton`	`metal` (experimental)
min-sum BP	yes (CPU reference)	yes (CPU + CUDA/ROCm)	yes (CUDA + ROCm)	yes (fp32)
Relay-BP	no	no	yes (CUDA + ROCm)	yes (fp32, slow — see below)

There is no in-package CPU Relay-BP; its CPU reference is IBM's relay-bp Rust decoder, wrapped in tridec.adapters and used as the validation oracle for the Triton path.

What's validated where

Environment	Status
CPU (any)	numpy BP reference; torch BP bit-identical to numpy at fp64 (one iteration), LER-identical full decode
NVIDIA H200, CUDA 12.4, torch 2.4.1, triton 3.0.0	Triton BP: ≥99.5% hard-decision agreement vs fp64 references, LER-identical (156 = 156 = 156 fails / 2000 shots vs numpy/torch). Triton Relay-BP: LER-matches the `relay-bp` Rust oracle (31 vs 38 fails / 2000, overlapping Wilson CIs) — carried source-repo receipts
AMD MI300X, ROCm 7.0.0, torch 2.9, triton 3.4.0	Same kernels, unmodified: identical primitive-identity numbers (pre-leg posterior max-diff 1.8e-15) and the same oracle-vs-Triton LER identity (carried receipts) — and validated through the installed package for v0.1.0 (`bench/receipts/mi300x_packaged.json`): full suite 88 passed / 10 skipped on gfx942 (GPU tiers bind, darwin-only strict tiers skip), packaged-API BP 166 = numpy 166 fails / 2000, Relay-BP fp32 34 vs Rust oracle 31 (overlapping CIs), throughput within ±2.2% of the carried receipt
Apple silicon (M4 Max), triton-metal	Experimental, spike-validated only (`bench/receipts/metal_spike.md`): both kernels pass the same correctness gates at fp32; see the section below

This table covers the two-kernel BP/Relay-BP path (v0.1). The v0.2 megakernel's own per-platform validation (14/14 gates on CUDA + ×ばつ on Metal at the lifted BLOCK=256 blocks — BP (256), relay (256,8)) is in the megakernel section above, with raw receipts in bench/receipts/megakernel_*.

Experimental: Apple silicon (Metal)

The same Triton kernels run on Apple-silicon GPUs through triton-metal, with zero changes to the kernel source. This is experimental: validated at spike level on one machine (M4 Max), fp32 only (Metal has no fp64), and not part of the official receipt set.

# triton-metal + a triton >= 3.6 build + torch must be importable, then:
pip install tridec
python -c "import tridec; print(tridec.available_backends())" # ['metal', ...]

backend="auto" detects the triton-metal environment (darwin, triton + triton_metal importable, no CUDA/ROCm device) and selects "metal"; backend="triton" resolves to "metal" there too, and backend="metal" asserts the environment is present. The execution pattern is triton-metal's documented one — CPU torch tensors (zero-copy via unified memory; not mps) — so no device arguments are needed.

What the spike measured (2000 canonical shots, seed 0, M4 Max — re-validated through this API path in tests/test_metal.py):

min-sum BP (fp32): all correctness gates pass — one-iteration hard agreement 1.000 vs the fp64 numpy reference on both the surface-code and BB-code fixtures; LER 76 = 76 / 2000 (surface) and 167 vs 168 / 2000 (BB). Batched decode of 2000 shots in 28 ms (surface) / 167 ms (BB) — ×ばつ the per-shot numpy baseline on the same machine.
Relay-BP (fp32): correct but slow — LER matches the relay-bp Rust oracle (31 vs 39 fails / 2000, per-shot agreement 99.3%), but decode_batch(2000) takes 31 s vs 1.26 s for the Rust CPU oracle: relay's per-iteration host loop (~7k small kernel launches) is launch-overhead dominated on Metal. Use it for validation, not production.
Relay-BP on metal enforces fp32: dtype="float64" raises with a clear error; the default resolves to float32.

No claims beyond the spike: no official LER receipts, no cross-machine validation, no performance tuning. CUDA/ROCm remain the supported GPU paths.

Compatibility floors in pyproject.toml; known-good pins: stim 1.15.0, ldpc 2.4.1, relay-bp 0.2.2, torch 2.4.1 / 2.9, triton 3.0 / 3.4.

Validation discipline

tridec.validation ships the matched-protocol harness the numbers were produced with: dem_hash (sha256 of the DEM's canonical bytes), run_matched (one shared DEM, one shot set, fail-fast DEM-identity and tie-break gates), Wilson/TOST statistics and a paired per-shot gap-to-MLE bootstrap. The test suite pins the extraction byte-for-byte: 8 canonical BB-code fixture circuits must hash to the exact DEM sha256s recorded in the carried zoo_grid.json receipt, and a full 16,667-shot cell must reproduce the recorded logical-failure counts of the ldpc reference adapters exactly.

For v0.1.0 the WHOLE grid was re-decoded in the receipt environment (bench/full_grid_noregression.py): 31 of 32 (cell, decoder) failure counts reproduce exactly — all 24 BP / BP-OSD-0 / BP-OSD-10 counts, and 7 of 8 BPLSD counts. The single deviation (BPLSD, p=0.002/X: 879 vs 880, one shot in 200,000) is attributed by a same-environment repeat experiment to run-to-run nondeterminism inside ldpc's BpLsdDecoder itself (identical shots, fresh instances, 5 repeats: 880/880/879/880/880 — the same single shot flips) — documented in bench/receipts/full_grid_noregression.json.

Status

0.2.1 — Relay-BP auto-dispatch: from_dem(..., algorithm="relay") / RelayBpDecoder use the megakernel by default on GPU (megakernel=False opts back to the two-kernel host loop); GPU-gated by construction. Validated on all three platforms — Metal (M4 Max), NVIDIA H200 (CUDA), AMD MI300X (ROCm/gfx942). Also: the statistical-tier validation gates now use a sample-size-aware Wilson-CI overlap test (#1). 0.2.0 added the megakernel backend (tri-platform validated; see above). v0.1.0 shipped the two-kernel BP/Relay-BP path + the validation discipline. The kernels and their receipts are stable; the public API surface is young and may still move before 1.0 — minor 0.x releases may rename or remove public API; 1.0 will lock the surface. GPU paths require triton

a CUDA/ROCm GPU (or the experimental triton-metal environment); the GPU/metal test tiers skip cleanly where unavailable.

License

Apache-2.0.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bledden/tridec

Folders and files

Latest commit

History

Repository files navigation

tridec

v0.2: the megakernel backend (opt-in)

Megakernel: honest limits + tuning

Install

Quickstart

Backend ×ばつ algorithm matrix (honest availability)

What's validated where

Experimental: Apple silicon (Metal)

Validation discipline

Status

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tridec

v0.2: the megakernel backend (opt-in)

Megakernel: honest limits + tuning

Install

Quickstart

Backend ×ばつ algorithm matrix (honest availability)

What's validated where

Experimental: Apple silicon (Metal)

Validation discipline

Status

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages