Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

ApodexAI/AgentHarness

Repository files navigation

Apodex-1.0

πŸ“°Tech Blog | πŸ“„Tech Report


AgentHarness

Evaluation harness for Apodex-1.0 on public deep-research benchmarks.

AgentHarness is the open-source evaluation harness used to reproduce the public benchmark results for Apodex-1.0 in a standard ReAct setup. Apodex-1.0 is a verification-centric model for deep research developed by the Apodex team. This repository focuses on the public, standard ReAct evaluation setup reported in the paper.

Apodex-1.0 results across deep-research benchmarks


πŸ“Š Performance

Open-source Apodex-1.0 variants on the four-benchmark deep-research suite:

Model BrowseComp BrowseComp-ZH HLE-Text DeepSearchQA
Apodex-1.0-mini 71.5 80.6 46.8 82.2
Apodex-1.0-4B-SFT 48.8 63.5 32.9 69.9
Apodex-1.0-2B-SFT 27.9 35.0 18.2 49.9
Apodex-1.0-0.8B-SFT 13.9 10.7 11.2 25.8

⚑ Quick Start on the harness

1. Install dependencies

uv sync --python 3.12

2. Serve the model (SGLang)

python3 -m sglang.launch_server \
 --model-path apodex/Apodex-1.0-35B-A3B \
 --tp 8 \
 --host 0.0.0.0 \
 --port 1234 \
 --context-length 262144 \
 --tool-call-parser qwen3_coder \
 --reasoning-parser qwen3 \

For smaller variants, other serving options, see the Hugging Face model card.

3. Configure environment variables

cp .env.example .env

Fill in the required keys in .env β€” OPENAI_BASE_URL / OPENAI_API_KEY / OPENAI_MODEL point at the agent model (your SGLang endpoint from step 2 or any OpenAI-compatible service); SERPER_API_KEY / JINA_API_KEY / E2B_API_KEY enable web search, web fetch, and the code sandbox respectively.

4. Download the benchmark datasets

wget https://huggingface.co/datasets/apodex/Deep-Research-Benchmarks/resolve/main/deep_research_benchmarks_260607.zip
unzip -P 'apodex*()_2026' deep_research_benchmarks_260607.zip
rm deep_research_benchmarks_260607.zip

The single quotes around the password are required β€” it contains shell-meta characters (*, (, )).

HLE is not included. Its license forbids redistributing the answers. To run hle_text, accept the license on cais/hle and place the JSONL at benchmarks/datasets/HLE-text/standardized_data.jsonl.

5. Run a smoke test

uv run python -m benchmarks.runner.run_subprocess \
 --benchmark browsecomp \
 --pipeline react_base \
 --profile default \
 --limit 1 \
 --concurrency 1 \
 --out ./tmp/smoke

6. Run a full benchmark

uv run python -m benchmarks.runner.run_subprocess \
 --benchmark browsecomp \
 --pipeline react_base \
 --profile default \
 --runs 5 \
 --concurrency 30 \
 --out ./bc-runs

7. Check progress and aggregate accuracy

uv run python -m benchmarks.runner.check_progress ./bc-runs

Each question runs in its own subprocess, which makes runs easier to reproduce and debug:

  • isolated execution per question
  • no asyncio saturation
  • individual hangs can be SIGKILL'd
  • failed samples can be rerun independently

βœ… Supported Benchmarks

BrowseComp, BrowseComp-ZH, xbench-DeepResearch, Humanity's Last Exam (text-only), SuperChem, FrontierScience-Research, FrontierScience-Olympiad, DeepSearchQA, WideSearch

See benchmarks/README.md for dataset layout, judge configuration, and how to add a new benchmark.


⭐ Star History

Star History Chart

πŸ“š Citation

@techreport{apodex2026,
 title = {Apodex-1.0: A Verification-Centric Agent Team for Discoverative Intelligence},
 author = {Apodex Team},
 year = {2026}
}

πŸ“„ License

Apache 2.0 β€” see LICENSE.

About

Evaluation harness for Apodex-1.0 on public deep-research benchmarks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /