| Operation |
Tool |
Time |
Size |
| Encode |
STRIDE |
0.173s |
96MB |
| Encode |
zstd -1 |
0.240s |
39MB |
| Encode |
zstd -9 |
2.146s |
31MB |
| Decode |
STRIDE |
0.089s |
100MB |
| Decode |
zstd -d |
0.125s |
100MB |
STRIDE encode: 28% faster than zstd -1
STRIDE decode: 40% faster than zstd -d
Trade-off: STRIDE does not compress. Use zstd for compression. Use STRIDE for deterministic container I/O.
Proof with SHA256 verification: proof/enwik8_benchmark.txt
V1 benchmark proof: proof/v1_benchmark.txt
Before → After
Before (glyph-v8, 3 months abandoned):
• Experimental L0-index with minimizer indexing
• No documentation, no architecture, no clear purpose
• Code sitting unused on an OVH server
• hit_rate 87.6% on old version, 99.8% on new — but no one knew
After (STRIDE v0):
• Full CLI with 10 commands
• Deterministic corpus analysis on any binary data
• Real benchmark on enwik8 100MB with SHA256-verified proof
• stride/ package installable via pip install -e .
• Structured container format (STRIDE01 magic, chunked layout)
• Cross-platform: Linux + OVH EPYC verified
• GitHub Actions CI — tests pass on every push
Architecture
RAW CORPUS
↓
STRIDE Container (.stridebin)
[MAGIC: STRIDE01][corpus_size][chunk_size][data...]
↓
Analysis Layer:
container-bytefreq → byte frequency histogram
container-hotspots → entropy per chunk
container-fingerprint → 128-value MinHash
container-headersketch → 64-slot structural sketch
↓
Model Output (model.json):
timestamp_field → Delta coding
status_field → Dictionary coding
id_field → Rice coding
↓
STRIDE v1 ✅: container-write (575 MB/s) + container-decode (1,053 MB/s)
container-compare --fast → HeaderSketch similarity in 7s (vs 150s full mode)
What Makes STRIDE Different
| grep |
zstd |
Elasticsearch |
STRIDE |
| Field-aware |
❌ |
❌ |
❌ |
✅ |
| Per-field entropy model |
❌ |
❌ |
❌ |
✅ |
| Deterministic output |
✅ |
✅ |
❌ |
✅ |
| Schema-aware analysis |
❌ |
❌ |
partial |
✅ |
| SHA256-verified proof |
❌ |
❌ |
❌ |
✅ |
Honest Benchmark Status
STRIDE v0 is a corpus analyzer, not a codec. It does not yet produce compressed output.
STRIDE v1 shipped. Encoder: 575 MB/s. Decoder: 1,053 MB/s. Round-trip MD5-verified on enwik8 100MB.
Entropy Heatmap
Red = high entropy (hard to compress) | Yellow = moderate | Each cell = 64KB chunk of enwik8
Theoretical compression gains (6-8x vs zstd on integer-heavy data) are derived from the entropy models STRIDE builds — not from measured compression results.
This is intentional. STRIDE v0 establishes the measurement foundation. STRIDE v1 builds on it.
How GitHub Copilot Helped
The original glyph-v8 was a pile of experimental scripts with no coherent design. Copilot helped:
• Reconstruct the project from scattered OVH files
• Design the StrideContainer format and reader
• Build the CLI dispatch architecture (argparse + subcommands)
• Implement all five analysis modules
• Write the benchmark pipeline with SHA256 verification
• Structure this submission
Without Copilot the gap between "abandoned prototype" and "installable system with proof" would have taken weeks. It took days.
Project Family
STRIDE is the third primitive in a deterministic systems family:
ACEAPEX — parallel LZ77 decode
9,903 MB/s on EPYC 9575F (64 cores). 2.5x faster than zstd. Merged into lzbench.
GLYPH — byte-exact substring retrieval
6,888x faster than grep on repeated queries. 1,138 organic git clones in 14 days with zero promotion.
STRIDE — field-aware integer analysis
Profiles binary protocol data. Builds per-field entropy models. Foundation for a codec that knows what zstd doesn’t.
Same philosophy across all three: deterministic, exact, measurable.
What’s Next
• Full benchmark suite vs zstd, LZ4, Brotli
• Protobuf schema-aware field extraction
• MessagePack and Thrift adapters
• Publish as standalone Python package on PyPI
Inspired by Perelman’s geometrization — the idea that complex structures simplify under the right flow. Every project in this family is an attempt to find that flow.