π°Tech Blog | πTech Report
Evaluation harness for Apodex-1.0 on public deep-research benchmarks.
AgentHarness is the open-source evaluation harness used to reproduce the public benchmark results for Apodex-1.0 in a standard ReAct setup. Apodex-1.0 is a verification-centric model for deep research developed by the Apodex team. This repository focuses on the public, standard ReAct evaluation setup reported in the paper.
Apodex-1.0 results across deep-research benchmarks
Open-source Apodex-1.0 variants on the four-benchmark deep-research suite:
| Model | BrowseComp | BrowseComp-ZH | HLE-Text | DeepSearchQA |
|---|---|---|---|---|
| Apodex-1.0-mini | 71.5 | 80.6 | 46.8 | 82.2 |
| Apodex-1.0-4B-SFT | 48.8 | 63.5 | 32.9 | 69.9 |
| Apodex-1.0-2B-SFT | 27.9 | 35.0 | 18.2 | 49.9 |
| Apodex-1.0-0.8B-SFT | 13.9 | 10.7 | 11.2 | 25.8 |
uv sync --python 3.12
python3 -m sglang.launch_server \ --model-path apodex/Apodex-1.0-35B-A3B \ --tp 8 \ --host 0.0.0.0 \ --port 1234 \ --context-length 262144 \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3 \
For smaller variants, other serving options, see the Hugging Face model card.
cp .env.example .env
Fill in the required keys in .env β OPENAI_BASE_URL / OPENAI_API_KEY / OPENAI_MODEL point at the agent model (your SGLang endpoint from step 2 or any OpenAI-compatible service); SERPER_API_KEY / JINA_API_KEY / E2B_API_KEY enable web search, web fetch, and the code sandbox respectively.
wget https://huggingface.co/datasets/apodex/Deep-Research-Benchmarks/resolve/main/deep_research_benchmarks_260607.zip
unzip -P 'apodex*()_2026' deep_research_benchmarks_260607.zip
rm deep_research_benchmarks_260607.zipThe single quotes around the password are required β it contains shell-meta characters (*, (, )).
HLE is not included. Its license forbids redistributing the answers. To run
hle_text, accept the license oncais/hleand place the JSONL atbenchmarks/datasets/HLE-text/standardized_data.jsonl.
uv run python -m benchmarks.runner.run_subprocess \ --benchmark browsecomp \ --pipeline react_base \ --profile default \ --limit 1 \ --concurrency 1 \ --out ./tmp/smoke
uv run python -m benchmarks.runner.run_subprocess \ --benchmark browsecomp \ --pipeline react_base \ --profile default \ --runs 5 \ --concurrency 30 \ --out ./bc-runs
uv run python -m benchmarks.runner.check_progress ./bc-runs
Each question runs in its own subprocess, which makes runs easier to reproduce and debug:
- isolated execution per question
- no asyncio saturation
- individual hangs can be
SIGKILL'd - failed samples can be rerun independently
BrowseComp, BrowseComp-ZH, xbench-DeepResearch, Humanity's Last Exam (text-only), SuperChem, FrontierScience-Research, FrontierScience-Olympiad, DeepSearchQA, WideSearch
See benchmarks/README.md for dataset layout, judge configuration, and how to add a new benchmark.
@techreport{apodex2026, title = {Apodex-1.0: A Verification-Centric Agent Team for Discoverative Intelligence}, author = {Apodex Team}, year = {2026} }
Apache 2.0 β see LICENSE.