-
Notifications
You must be signed in to change notification settings - Fork 0
Open
@tac0turtle
Description
Goal
Expand the Evolve simulator into a comprehensive testing and validation tool that runs in CI on every PR. The finished state: CI catches non-determinism, fuzzes critical paths, and tracks performance regressions automatically.
Current State
- Simulator (
evolve_simulator): Seed-based determinism, fault injection, time simulation, basic metrics/reporting. Used in testapp integration tests. - Fuzzing: Limited to tx encoding (
fuzz_decode,fuzz_roundtrip,fuzz_structured) incrates/app/tx/fuzz/. Requirescargo +nightly fuzz— not integrated into CI. - CI (
rust.yml): Runscargo test --workspaceon every PR. Long simulation tests exist but are manual-only (workflow_dispatch). - No non-determinism detection: No automated check that the same seed produces identical state across runs.
Scope
1. Non-Determinism Detection
- Dual-execution oracle: Run the same simulation seed twice and assert identical state hashes at every block. This is the core invariant — if it ever fails, consensus is broken.
- Cross-platform determinism check: Ensure state hashes match between macOS and Linux (CI runs Linux, devs run macOS). May require pinning float ops or auditing platform-dependent behavior.
- Iteration order audit: Automated check that no
HashMap/HashSetusage leaks into STF execution paths (beyond the existing clippy lint — runtime verification). - Time source audit: Verify no
SystemTime/Instantusage reaches STF execution. Simulator'sSimulatedTimeshould be the only time source during execution.
2. Expanded Fuzzing
- STF fuzzing: Fuzz the full
apply_blockpath with randomly generated blocks (random tx ordering, random payloads, malformed inputs). Assert no panics, no state corruption. - Storage layer fuzzing: Fuzz the storage backend with random key/value operations. Verify state hash consistency after commit.
- Account execution fuzzing: Generate random sequences of exec/query calls against accounts. Verify error handling (no panics, proper error codes).
- Mempool fuzzing: Fuzz transaction insertion, eviction, and ordering under concurrent load.
- Corpus management: Maintain and expand fuzz corpus with interesting inputs found during runs. Store corpus in CI cache.
3. Performance Testing & Regression Detection
- Benchmark baseline: Establish criterion benchmarks for key paths (block execution, tx processing, storage read/write, state hashing).
- Simulator performance report: Extend
PerformanceReportwith p50/p95/p99 latencies, throughput (tx/s, blocks/s), and memory high-water mark. - Regression detection: Compare benchmark results against a baseline (stored as artifact or in-repo). Fail CI or post a warning comment if performance degrades beyond threshold (e.g., >10%).
- Stress test profile: Standardized stress test config (high block count, high tx volume, fault injection) that runs on a schedule (nightly or weekly).
4. CI Integration
- Simulation tests on every PR: Run a short simulation suite (e.g., 100 blocks, 3 seeds) as part of the standard test job. Must complete in <5 min.
- Non-determinism check on every PR: Dual-execution with at least 2 seeds. Fast — just comparing hashes.
- Nightly fuzzing job: Run cargo-fuzz (or bolero/proptest long runs) for extended duration (30–60 min). Report findings as issues.
- Nightly performance run: Run benchmarks + long simulation, store results as artifacts, compare against baseline.
- Seed rotation: Each CI run uses a mix of fixed seeds (regression) and random seeds (exploration). Failed random seeds are logged for reproduction.
- Failure reproduction: On any simulation failure, CI output includes the exact
just sim-seed <seed>command to reproduce locally.
5. Simulator Enhancements
- Transaction generators: Configurable random transaction generators for simulation (valid txs, invalid txs, edge-case txs). Currently manual — should be built into the simulator.
- Scenario DSL or config: Define simulation scenarios (e.g., "normal load for 50 blocks, then spike to 10x, then fault injection") as config rather than code.
- Shrinking on failure: When a simulation fails, automatically try to find the minimal reproducing seed/block sequence (inspired by proptest shrinking).
- Coverage tracking: Integrate with coverage tools to measure what % of STF/account code paths are exercised by simulation.
Success Criteria
- Every PR runs simulation tests (short suite, <5 min) + non-determinism dual-execution check.
- Nightly CI job fuzzes STF, storage, and mempool for 30+ min and files issues on findings.
- Performance benchmarks run nightly with regression detection — degradations >10% are flagged.
- A failing simulation always prints its reproduction command.
- Zero known non-determinism sources in the STF execution path.
Implementation Notes
- Start with non-determinism detection (highest value, lowest effort) — dual-execution is just "run twice, compare hashes."
- For fuzzing, consider
boleroas it works with both libfuzzer and proptest backends, avoiding the nightly-onlycargo-fuzzlimitation. - Performance baselines can use GitHub Actions artifacts or
git notesfor storage. - Keep CI wall time in check — simulation and fuzzing are useless if they make PRs slow. Short suite on PR, long suite on nightly.