SiEval is a model delivery quality verification system with an asynchronous streaming evaluation engine, iterative feedback loop, and resilient sharded persistence. It verifies the entire model delivery pipeline — training → conversion → inference → evaluation.
- Asynchronous streaming — process samples concurrently without waiting for batch completion
- Iterative feedback loop — multi-turn evaluation with feedback
- Resilient persistence — sharded, append-only storage for crash recovery
- 11 mainstream benchmarks — AIME 2024/2025, DROP, GPQA-Diamond, HumanEval, IFEval, LiveCodeBench, MATH-500, MMLU, MMLU-Pro, T-Eval (math, code, reasoning, knowledge, instruction-following, tool-use)
- Type-safe pipelines — fully typed task stages (preprocess → infer → postprocess → feedback)
- YAML-based configuration — batch evaluation with model derivation and quota allocation
- Inference orchestration — recipe-driven inference with auto-resolve and backend abstraction (vLLM, SGLang)
- Anomaly detection — built-in detection rules for output quality, performance, and correctness
- Profiling — stage timing, I/O metrics, and token usage tracking
Requirements: Unix (Linux, macOS), Python ≥ 3.12, PDM (recommended) or pip
git clone https://github.com/scitix/sieval.git cd sieval pdm install # or: pip install -e .
Optional extras (per-benchmark dependencies):
pip install -e ".[math]" # AIME 2024/2025, MATH-500 (math-verify) pip install -e ".[drop]" # DROP (numpy, scipy) pip install -e ".[ifeval]" # IFEval (absl, langdetect, nltk, immutabledict) pip install -e ".[t-eval]" # T-Eval (numpy, sentence-transformers) pip install -e ".[math,drop,ifeval,t-eval]" # all extras at once
Dataset paths below use HuggingFace repo ids for HF-sourced datasets (e.g.
HuggingFaceH4/aime_2024) and${SIEVAL_DATA_DIR}/<name>for URL-sourced datasets. SetSIEVAL_DATA_DIR(default~/.sieval/data) before running any command that resolves a URL-sourced dataset.
Start from an example — two-step flow:
cp examples/quickstart.yaml eval.yaml $EDITOR eval.yaml # set model checkpoint + container image # Step 1: stage the data sieval dataset download aime_2024 # Step 2: run eval sieval eval eval.yaml
See examples/README.md for more scenarios (leaderboard, recipe overrides) and examples/hardware/ for hardware-pinned reference configs.
Discover tasks / datasets:
sieval dataset list # registered datasets + licenses + download status sieval task list --domain Mathematics # filter tasks by domain sieval dataset show aime_2024 # dataset detail, incl. the YAML path: to paste sieval dataset download aime_2024 # stage data into $SIEVAL_DATA_DIR
All-in-one (launch inference, evaluate, cleanup — recommended entry point):
sieval run config.yaml sieval run config.yaml --resume
Evaluate against an already-online endpoint:
sieval eval leaderboards/sft_fast_202511.yaml --model gpt-4o # `sieval eval` is a shortcut for the underlying resource verb: sieval leaderboard run leaderboards/sft_fast_202511.yaml --model gpt-4o
Inference management:
sieval infer start /path/to/Qwen3-8B # auto-resolve and launch sieval infer list # show running services sieval infer logs qwen3-8b -f # stream engine logs sieval infer stop qwen3-8b # graceful shutdown
Programmatic usage:
import anyio from sieval.datasets import MMLUDataset from sieval.tasks import MMLUZeroShotGenTask from sieval.core.models import ChatModel from sieval.core.runners import TaskRunner, TaskRunnerConfig async def main(): dataset = MMLUDataset("cais/mmlu") model = ChatModel("gpt-4o", max_retries=3, concurrency_limit=128) task = MMLUZeroShotGenTask(dataset=dataset, model=model) runner = TaskRunner( task=task, config=TaskRunnerConfig(result_dir="./outputs/mmlu", auto_resume=True), ) results = await runner.arun() print(results) anyio.run(main)
- Configuration Guide — YAML format, task pipeline, model resource pool, anomaly detection
- Concurrency Control — four-level concurrency model
- Profiling & Observability — stage timing, I/O metrics, token tracking
- Inference Management — full infer subcommand reference
See CONTRIBUTING.md for development setup, project architecture, code conventions, and the PR process.
Apache License 2.0 — see LICENSE for details.
@software{sieval2026, title = {SiEval: Asynchronous Streaming Evaluation Framework}, author = {{ScitiX}}, year = {2026}, url = {https://github.com/scitix/sieval} }