Benchmarking Causal Thinking of LLM Agents in Games β ICML 2026 (Oral)
π Project page & leaderboard Β· π Paper (arXiv coming soon)
CausalGame evaluates the causal thinking of LLM agents through interactive games. The agent acts as a drone designer who must uncover the hidden causal mechanism behind drone survival through a limited budget of experiments β under observations that are censored by survivorship, confounded by hidden variables, and corrupted by noise. Every scenario is governed by an underlying structural causal model (SCM): the agent can only win by recovering the true causal mechanism rather than fitting surface-level correlations.
- 14 game scenarios across three families (Antenna Trap, Deployment Zone Trap, Weather), instantiating selection bias, hidden confounders, noisy measurements, local optima, and environment shift
- Two-stage protocol: exploration (200 drones, β€10 deployments) β one-shot evaluation on a 1,000-drone fleet against a scenario-specific win threshold
- Multiple execution modes: Agentic (ReAct tool calling), Prompting (single-turn code execution), and an OpenCode coding-agent harness
- 29 frontier LLMs evaluated in the paper β see the leaderboard
git clone https://github.com/viewsetting/CausalGame.git cd CausalGame python -m venv .venv && source .venv/bin/activate pip install -r requirements.txt cp .env.example .env # then fill in the API keys you need
uvicorn api.app:app --port 8000 # or with Docker: # docker build -t causalgame . && docker run -p 8000:8000 causalgame
The default scenario is antenna_trap; switch via the CAUSALGAME_EXPERIMENT
environment variable or the admin API:
curl -X POST "http://localhost:8000/api/admin/experiment/switch?experiment_name=weather_noise"python run_agent.py --list-models # see the 29 configured models # Prompting mode (single-context, agent writes Python against `client`) python run_agent.py --model gpt-5-mini --experiment antenna_trap --mode legacy # Agentic mode (multi-turn ReAct tool calling) python run_agent.py --model gpt-5-mini --experiment antenna_trap --mode hybrid
Mode naming:
--mode hybridis the paper's Agentic (ReAct) mode and--mode legacyis the paper's Prompting mode.
The coding-agent harness used for the paper's OpenCode comparison runs the game backend and an OpenCode agent in Docker:
./run_opencode.sh -e antenna_trap -m openrouter/moonshotai/kimi-k2.5 --run
| Scenario | Selection bias | Hidden confounder | Threshold |
|---|---|---|---|
antenna_trap |
β | β | 75% |
antenna_trap_high_def |
β | β | 75% |
antenna_trap_local_optima |
β | β | 75% |
antenna_trap_no_history |
β | β | 75% |
antenna_trap_no_selection_bias |
β | β | 75% |
antenna_trap_simpsons_paradox |
β | β | 75% |
deployment_zone_trap_categorical |
β | β | 75% |
deployment_zone_trap_categorical_high_def |
β | β | 75% |
deployment_zone_trap_categorical_local_optima |
β | β | 75% |
deployment_zone_trap_categorical_no_history |
β | β | 75% |
deployment_zone_trap_categorical_no_selection_bias |
β | β | 75% |
deployment_zone_trap_categorical_simpsons_paradox |
β | β | 75% |
deployment_zone_trap_env_shift |
β | β | 75% |
weather_noise |
β | β | 55% |
Scenario configs live in experiments/<name>/ (game.json, action_space.json,
environment_variables.json, prompts). Fully specified structural equations for all
scenarios are given in the paper appendix.
CausalGame/
βββ api/ # FastAPI simulation engine
β βββ middleware/ # DroneSheet (single source of truth), visibility control
β βββ modules/ # SCM environments, game logic (combat/damage/judge), agent API
βββ agent/ # LLM agent harness (providers, tools, sandbox, orchestrator)
βββ experiments/ # the 14 benchmark scenario configs
βββ config/ # agent.json (29 paper models), server.json
βββ docs/ # architecture, API reference, adding new SCM scenarios
βββ run_agent.py # single-run entry point (one model Γγ°γ€ one scenario)
βββ run_opencode.sh # OpenCode coding-agent harness (Docker)
Scenarios are pluggable SCMs registered with the @register_scm() decorator β
see docs/adding-new-scm.md.
python -m unittest discover -s tests
@inproceedings{chen2026causalgame, title = {CausalGame: Benchmarking Causal Thinking of LLM Agents in Games}, author = {Chen, Zhenhao and Chen, Yongqiang and Liu, Chenxi and Yu, Junchi and Song, Xiangchen and Li, Zijian and Li, Jialin and Torr, Philip and Han, Bo and Zhang, Kun}, booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)}, year = {2026}, note = {Oral presentation} }