Name	Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows	.github/workflows
agent_harness	agent_harness
assets	assets
benchmarks	benchmarks
plugins	plugins
scripts	scripts
workflows	workflows
.env.example	.env.example
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md
pyproject.toml	pyproject.toml
uv.lock	uv.lock

AgentHarness

Evaluation harness for Apodex-1.0 on public deep-research benchmarks.

AgentHarness is the open-source evaluation harness used to reproduce the public benchmark results for Apodex-1.0 in a standard ReAct setup. Apodex-1.0 is a verification-centric model for deep research developed by the Apodex team. This repository focuses on the public, standard ReAct evaluation setup reported in the paper.

Apodex-1.0 results across deep-research benchmarks

📊 Performance

Open-source Apodex-1.0 variants on the four-benchmark deep-research suite:

Model	BrowseComp	BrowseComp-ZH	HLE-Text	DeepSearchQA
Apodex-1.0-mini	71.5	80.6	46.8	82.2
Apodex-1.0-4B-SFT	48.8	63.5	32.9	69.9
Apodex-1.0-2B-SFT	27.9	35.0	18.2	49.9
Apodex-1.0-0.8B-SFT	13.9	10.7	11.2	25.8

⚡ Quick Start on the harness

1. Install dependencies

uv sync --python 3.12

2. Serve the model (SGLang)

python3 -m sglang.launch_server \
 --model-path apodex/Apodex-1.0-35B-A3B \
 --tp 8 \
 --host 0.0.0.0 \
 --port 1234 \
 --context-length 262144 \
 --tool-call-parser qwen3_coder \
 --reasoning-parser qwen3 \

For smaller variants, other serving options, see the Hugging Face model card.

3. Configure environment variables

cp .env.example .env

Fill in the required keys in .env — OPENAI_BASE_URL / OPENAI_API_KEY / OPENAI_MODEL point at the agent model (your SGLang endpoint from step 2 or any OpenAI-compatible service); SERPER_API_KEY / JINA_API_KEY / E2B_API_KEY enable web search, web fetch, and the code sandbox respectively.

4. Download the benchmark datasets

wget https://huggingface.co/datasets/apodex/Deep-Research-Benchmarks/resolve/main/deep_research_benchmarks_260607.zip
unzip -P 'apodex*()_2026' deep_research_benchmarks_260607.zip
rm deep_research_benchmarks_260607.zip

The single quotes around the password are required — it contains shell-meta characters (*, (, )).

HLE is not included. Its license forbids redistributing the answers. To run hle_text, accept the license on cais/hle and place the JSONL at benchmarks/datasets/HLE-text/standardized_data.jsonl.

5. Run a smoke test

uv run python -m benchmarks.runner.run_subprocess \
 --benchmark browsecomp \
 --pipeline react_base \
 --profile default \
 --limit 1 \
 --concurrency 1 \
 --out ./tmp/smoke

6. Run a full benchmark

uv run python -m benchmarks.runner.run_subprocess \
 --benchmark browsecomp \
 --pipeline react_base \
 --profile default \
 --runs 5 \
 --concurrency 30 \
 --out ./bc-runs

7. Check progress and aggregate accuracy

uv run python -m benchmarks.runner.check_progress ./bc-runs

Each question runs in its own subprocess, which makes runs easier to reproduce and debug:

isolated execution per question
no asyncio saturation
individual hangs can be SIGKILL'd
failed samples can be rerun independently

✅ Supported Benchmarks

BrowseComp, BrowseComp-ZH, xbench-DeepResearch, Humanity's Last Exam (text-only), SuperChem, FrontierScience-Research, FrontierScience-Olympiad, DeepSearchQA, WideSearch

See benchmarks/README.md for dataset layout, judge configuration, and how to add a new benchmark.

⭐ Star History

📚 Citation

@techreport{apodex2026,
 title = {Apodex-1.0: A Verification-Centric Agent Team for Discoverative Intelligence},
 author = {Apodex Team},
 year = {2026}
}

📄 License

Apache 2.0 — see LICENSE.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ApodexAI/AgentHarness

Folders and files

Latest commit

History

Repository files navigation

AgentHarness

📊 Performance

⚡ Quick Start on the harness

1. Install dependencies

2. Serve the model (SGLang)

3. Configure environment variables

4. Download the benchmark datasets

5. Run a smoke test

6. Run a full benchmark

7. Check progress and aggregate accuracy

✅ Supported Benchmarks

⭐ Star History

📚 Citation

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AgentHarness

📊 Performance

⚡ Quick Start on the harness

1. Install dependencies

2. Serve the model (SGLang)

3. Configure environment variables

4. Download the benchmark datasets

5. Run a smoke test

6. Run a full benchmark

7. Check progress and aggregate accuracy

✅ Supported Benchmarks

⭐ Star History

📚 Citation

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages