Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Makeph/honest-backtest

Repository files navigation

Au2fut — CME Micro Futures Prop Edge Harness

Honest backtest — prove an edge out-of-sample or kill it

Research only. No live trading, no execution code. This phase answers one question honestly: does a tradeable edge survive real futures costs and a prop firm's drawdown rules? Built on the discipline of the au2 / Au2qwen edge investigation (see docs/METHODOLOGY.md).

Why

au2 proved on BTC that a prop challenge is −EV by construction when no edge clears costs. Au2fut re-asks the question on CME micro futures (MES/MNQ/MGC) and futures prop firms (Topstep/Apex), where costs are a few $ per contract and equity indices carry real trend — but it refuses to build any trading infra until the edge is proven net of costs, out-of-sample and forward.

Install

pip install -r requirements.txt
cp .env.example .env # research-only; no live keys

The pipeline

data/fetch.py bars — pluggable source (yahoo|databento|ibkr|csv), cached
 └─ core/instruments.py exact tick value + $ cost model (env-overridable)
 └─ diagnostics/edge_scan.py breakout net $/contract verdict
 └─ diagnostics/mr_session.py session mean-reversion (pre-registered, OOS)
 └─ diagnostics/oos_validate.py train/test + walk-forward OOS
 └─ core/prop_rules.py Topstep/Apex trailing-DD machine
 └─ diagnostics/prop_mc.py P(pass), EV (--strategy mr|breakout)

Data sources (set FUT_DATA_SOURCE)

Source Depth Setup
yahoo (default) intraday ~60d (tiny OOS) none
databento CME Globex minute/tick, multi-year pip install databento, DATABENTO_API_KEY
ibkr IBKR historical (pacing-limited) pip install ib_async, TWS/Gateway running
csv whatever you export data/csv/<SYM>_<interval>.csv or FUT_CSV_PATH

Sub-hourly bars (5m/15m/30m) for databento/ibkr are aggregated from 1m, aligned to midnight UTC so RTH filtering stays correct. All diagnostics are source-agnostic — switching FUT_DATA_SOURCE changes nothing in the strategy code. The whole point: re-run the MR OOS verdict on years of multi-regime data the moment you plug in Databento/IBKR.

Run it

One command, any strategy, honest out-of-sample verdict (never an in-sample number):

python validate.py edge MES 5m 1y # Donchian breakout
python validate.py mr MES 5m 1y # session mean-reversion
python validate.py spread MES MNQ 5m 1y # cointegrated spread MR

For deep multi-regime data (the only way to trust the verdict), set the source:

# PowerShell: $env:FUT_DATA_SOURCE="databento"; $env:DATABENTO_API_KEY="db-..."
FUT_DATA_SOURCE=databento DATABENTO_API_KEY=db-... python validate.py mr MNQ 5m 1y

Lower-level tools the CLI wraps:

python -m data.fetch MES 1h 60d # sanity-check data
python -m diagnostics.edge_scan MES 1h 60d # in-sample sweep (context)
python -m diagnostics.prop_mc MES 5m 1y topstep_50k --strategy mr --contracts 2
python -m pytest tests/ -q # trust the rule engine

First read — and why it did NOT survive honest OOS

In-sample (edge_scan, whole window) looked encouraging:

Metric In-sample result
MES 1h Donchian positive across ~22/24 configs, best ~48ドル/trade PF 1.54
MNQ 1h Donchian strongly positive (~21ドルk/yr/contract best)
Topstep 50K P(pass) @ 2 ct (prop_mc) ~59%

Then oos_validate.py (select params on train, trade held-out test) deflated it:

Test OOS result Verdict
MES 1h walk-forward net_mean −0ドル.19, PF 1.00 (n=20) edge gone — in-sample was a fit
MNQ 1h walk-forward net_mean +75,ドル PF 1.39 (n=18) weakly positive, too thin
MES/MNQ/MGC 1d 2y n=4–12 per slice, signs flip noise — inconclusive

Pushing Yahoo to its limit (5m/60d ≈ 13k bars) gave a statistically real OOS sample — and it was conclusive:

Instrument TF OOS n net_mean PF win%
MES 5m 162 −7ドル.88 0.73 31%
MNQ 5m 160 −15ドル.38 0.80 41%
MES 30m 33 −6ドル.61 0.93 30%
MNQ 30m 37 −59ドル.51 0.73 30%

Verdict (Donchian breakout, MES/MNQ intraday, this data): no edge. Every TF collapses OOS — TRAIN PF 1.4–19, TEST negative, the textbook overfit signature. The n=162 5m sample is large enough to trust. The cheap futures cost structure did NOT rescue it because the signal itself is non-predictive intraday — the same thing au2 found for the seconds-scale BTC signal. The in-sample 59% prop pass-rate was a complete mirage.

Scope of this verdict: it kills the breakout hypothesis on these instruments on this data — not "no edge of any kind exists." Testing more strategy families is possible but must be done OOS-first / pre-registered to avoid data-mining a false winner (test enough strategies and one looks great in-sample by luck). Deeper minute data (Databento/IBKR) would also let mean-reversion / session strategies be judged on hundreds of OOS trades — see docs/METHODOLOGY.md.

Session mean-reversion — the first hypothesis to SURVIVE OOS

diagnostics/mr_session.py (pre-registered: fade Bollinger extremes inside RTH, flat at close — the inverse of breakout). Anchored walk-forward, OOS:

Instrument TF OOS n net_mean PF survives 2-tick slip? 3-tick?
MES 5m 68 +5ドル.67 → +4ドル.42 1.22 yes (+4ドル.42) no (−2ドル.14)
MNQ 5m 90 +3ドル.13 → +2ドル.63 1.05 yes (thin)
MES 15m 29 +16ドル.76 1.86

On the shallow 60-71d Yahoo window this looked like the first thing across BTC and futures to survive a pre-registered OOS test — but deep data killed it.

Deep-data verdict (Databento, 1 full year, 5m) — MR EDGE IS DEAD

Instrument OOS n net_mean PF verdict
MES 5m 1y 184 −5ドル.76 0.84 rejected
MNQ 5m 1y 279 −2ドル.68 0.96 rejected

Both negative even in-sample (MES −4ドル.89, MNQ −2ドル.41 over 441/427 trades). The summer-2026 positive was a regime artifact — gone on a multi-regime year. The few euros of Databento bought certainty and saved a real-capital deposit. This is the same verdict au2 reached on BTC: no edge clears costs. The line below about "thin but real" applied only to the shallow window and no longer holds.

(Historical note — the shallow-data reading that did NOT survive:) the edge was ~3-4 ticks gross and died at 3-tick slippage; fading the open is where slippage is worst.

Does it pass a prop challenge? No, not reliably (prop_mc --strategy mr, MES 5m, Topstep 50K, 2-tick slip): P(pass) ~0% at 1 ct, ~5% at 2 ct, ~28% at 4 ct — but 4 ct carries a 64% blow-up rate. The thin edge can't both hit a fixed-$ target and survive a fixed-$ trailing drawdown at the size required. Same structural wall au2's crypto prop_mc found.

Where it could matter: a personal account (no deadline, no trailing DD, small size, slow compounding) — not "fast cash," and only if real execution slippage stays ≤ 2 ticks. Mandatory next steps before any money: deeper minute data across multiple regimes, and forward live-paper measuring actual fill slippage on the open fade.

Layout

Au2fut/
├── core/
│ ├── instruments.py CME micro specs + $ cost model
│ └── prop_rules.py Topstep/Apex trailing-DD / daily-loss / target engine
├── data/
│ └── fetch.py Yahoo bar fetcher (cached); swap for Databento/IBKR later
├── diagnostics/
│ ├── edge_scan.py vol-gated Donchian backtest, net $ verdict
│ └── prop_mc.py Monte Carlo P(pass)/EV through the prop rules
├── tests/ prop-rules engine tests (9, all green)
└── docs/METHODOLOGY.md the discipline, ported from au2

Status / next steps

  • Honest $ cost model + prop-rules engine (tested)
  • Edge-scan + prop Monte Carlo, runnable on real data
  • OOS validation (oos_validate.py) — in-sample edge did NOT survive on free data
  • Deeper data: minute bars w/ years of history (Databento/IBKR) → swap data/fetch.py
  • Re-run OOS for a statistically meaningful sample (target ≥ 100 OOS trades)
  • Calibrate FUT_RT_COMMISSION to a real prop plan fee schedule
  • Forward live-paper to confirm backtest net $/trade
  • Only then: executor + prop-risk guard (port from au2 prop_guard.py)

About

Prove a trading edge survives real costs out-of-sample, or kill it. An honest edge-validation harness for CME micro futures & prop challenges.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

Languages

AltStyle によって変換されたページ (->オリジナル) /