Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

viewsetting/CausalGame

Repository files navigation

CausalGame

Benchmarking Causal Thinking of LLM Agents in Games β€” ICML 2026 (Oral)

🌐 Project page & leaderboard Β· πŸ“„ Paper (arXiv coming soon)

CausalGame evaluates the causal thinking of LLM agents through interactive games. The agent acts as a drone designer who must uncover the hidden causal mechanism behind drone survival through a limited budget of experiments β€” under observations that are censored by survivorship, confounded by hidden variables, and corrupted by noise. Every scenario is governed by an underlying structural causal model (SCM): the agent can only win by recovering the true causal mechanism rather than fitting surface-level correlations.

  • 14 game scenarios across three families (Antenna Trap, Deployment Zone Trap, Weather), instantiating selection bias, hidden confounders, noisy measurements, local optima, and environment shift
  • Two-stage protocol: exploration (200 drones, ≀10 deployments) β†’ one-shot evaluation on a 1,000-drone fleet against a scenario-specific win threshold
  • Multiple execution modes: Agentic (ReAct tool calling), Prompting (single-turn code execution), and an OpenCode coding-agent harness
  • 29 frontier LLMs evaluated in the paper β€” see the leaderboard

Quick start

1. Install

git clone https://github.com/viewsetting/CausalGame.git
cd CausalGame
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # then fill in the API keys you need

2. Start the simulation backend

uvicorn api.app:app --port 8000
# or with Docker:
# docker build -t causalgame . && docker run -p 8000:8000 causalgame

The default scenario is antenna_trap; switch via the CAUSALGAME_EXPERIMENT environment variable or the admin API:

curl -X POST "http://localhost:8000/api/admin/experiment/switch?experiment_name=weather_noise"

3. Run an agent

python run_agent.py --list-models # see the 29 configured models
# Prompting mode (single-context, agent writes Python against `client`)
python run_agent.py --model gpt-5-mini --experiment antenna_trap --mode legacy
# Agentic mode (multi-turn ReAct tool calling)
python run_agent.py --model gpt-5-mini --experiment antenna_trap --mode hybrid

Mode naming: --mode hybrid is the paper's Agentic (ReAct) mode and --mode legacy is the paper's Prompting mode.

OpenCode mode (optional)

The coding-agent harness used for the paper's OpenCode comparison runs the game backend and an OpenCode agent in Docker:

./run_opencode.sh -e antenna_trap -m openrouter/moonshotai/kimi-k2.5 --run

The 14 scenarios

Scenario Selection bias Hidden confounder Threshold
antenna_trap βœ“ β€” 75%
antenna_trap_high_def βœ“ βœ“ 75%
antenna_trap_local_optima βœ“ β€” 75%
antenna_trap_no_history βœ“ β€” 75%
antenna_trap_no_selection_bias β€” β€” 75%
antenna_trap_simpsons_paradox βœ“ βœ“ 75%
deployment_zone_trap_categorical βœ“ β€” 75%
deployment_zone_trap_categorical_high_def βœ“ βœ“ 75%
deployment_zone_trap_categorical_local_optima βœ“ β€” 75%
deployment_zone_trap_categorical_no_history βœ“ β€” 75%
deployment_zone_trap_categorical_no_selection_bias β€” β€” 75%
deployment_zone_trap_categorical_simpsons_paradox βœ“ βœ“ 75%
deployment_zone_trap_env_shift βœ“ β€” 75%
weather_noise βœ“ β€” 55%

Scenario configs live in experiments/<name>/ (game.json, action_space.json, environment_variables.json, prompts). Fully specified structural equations for all scenarios are given in the paper appendix.

Project structure

×ば぀ one scenario) └── run_opencode.sh # OpenCode coding-agent harness (Docker)">
CausalGame/
β”œβ”€β”€ api/ # FastAPI simulation engine
β”‚ β”œβ”€β”€ middleware/ # DroneSheet (single source of truth), visibility control
β”‚ └── modules/ # SCM environments, game logic (combat/damage/judge), agent API
β”œβ”€β”€ agent/ # LLM agent harness (providers, tools, sandbox, orchestrator)
β”œβ”€β”€ experiments/ # the 14 benchmark scenario configs
β”œβ”€β”€ config/ # agent.json (29 paper models), server.json
β”œβ”€β”€ docs/ # architecture, API reference, adding new SCM scenarios
β”œβ”€β”€ run_agent.py # single-run entry point (one model ×ば぀ one scenario)
└── run_opencode.sh # OpenCode coding-agent harness (Docker)

Adding a new scenario

Scenarios are pluggable SCMs registered with the @register_scm() decorator β€” see docs/adding-new-scm.md.

Tests

python -m unittest discover -s tests

Citation

@inproceedings{chen2026causalgame,
 title = {CausalGame: Benchmarking Causal Thinking of LLM Agents in Games},
 author = {Chen, Zhenhao and Chen, Yongqiang and Liu, Chenxi and Yu, Junchi
 and Song, Xiangchen and Li, Zijian and Li, Jialin
 and Torr, Philip and Han, Bo and Zhang, Kun},
 booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
 year = {2026},
 note = {Oral presentation}
}

License

MIT

About

Core of CausalGame Backend Service.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /