Name	Name	Last commit message	Last commit date
Latest commit History 1 Commit
.claude	.claude
.github	.github
docs/guide	docs/guide
examples	examples
leaderboards	leaderboards
scripts	scripts
sieval	sieval
submodules	submodules
tests	tests
.dockerignore	.dockerignore
.gitignore	.gitignore
.gitmodules	.gitmodules
.markdownlint.yaml	.markdownlint.yaml
.pre-commit-config.yaml	.pre-commit-config.yaml
CHANGELOG.md	CHANGELOG.md
CLAUDE.md	CLAUDE.md
CONTRIBUTING.md	CONTRIBUTING.md
Dockerfile	Dockerfile
LICENSE	LICENSE
README.md	README.md
pdm.lock	pdm.lock
pyproject.toml	pyproject.toml

SiEval

SiEval is a model delivery quality verification system with an asynchronous streaming evaluation engine, iterative feedback loop, and resilient sharded persistence. It verifies the entire model delivery pipeline — training → conversion → inference → evaluation.

Features

Asynchronous streaming — process samples concurrently without waiting for batch completion
Iterative feedback loop — multi-turn evaluation with feedback
Resilient persistence — sharded, append-only storage for crash recovery
11 mainstream benchmarks — AIME 2024/2025, DROP, GPQA-Diamond, HumanEval, IFEval, LiveCodeBench, MATH-500, MMLU, MMLU-Pro, T-Eval (math, code, reasoning, knowledge, instruction-following, tool-use)
Type-safe pipelines — fully typed task stages (preprocess → infer → postprocess → feedback)
YAML-based configuration — batch evaluation with model derivation and quota allocation
Inference orchestration — recipe-driven inference with auto-resolve and backend abstraction (vLLM, SGLang)
Anomaly detection — built-in detection rules for output quality, performance, and correctness
Profiling — stage timing, I/O metrics, and token usage tracking

Installation

Requirements: Unix (Linux, macOS), Python ≥ 3.12, PDM (recommended) or pip

git clone https://github.com/scitix/sieval.git
cd sieval
pdm install # or: pip install -e .

Optional extras (per-benchmark dependencies):

pip install -e ".[math]" # AIME 2024/2025, MATH-500 (math-verify)
pip install -e ".[drop]" # DROP (numpy, scipy)
pip install -e ".[ifeval]" # IFEval (absl, langdetect, nltk, immutabledict)
pip install -e ".[t-eval]" # T-Eval (numpy, sentence-transformers)
pip install -e ".[math,drop,ifeval,t-eval]" # all extras at once

Quick Start

Dataset paths below use HuggingFace repo ids for HF-sourced datasets (e.g. HuggingFaceH4/aime_2024) and ${SIEVAL_DATA_DIR}/<name> for URL-sourced datasets. Set SIEVAL_DATA_DIR (default ~/.sieval/data) before running any command that resolves a URL-sourced dataset.

Start from an example — two-step flow:

cp examples/quickstart.yaml eval.yaml
$EDITOR eval.yaml # set model checkpoint + container image
# Step 1: stage the data
sieval dataset download aime_2024
# Step 2: run eval
sieval eval eval.yaml

See examples/README.md for more scenarios (leaderboard, recipe overrides) and examples/hardware/ for hardware-pinned reference configs.

Discover tasks / datasets:

sieval dataset list # registered datasets + licenses + download status
sieval task list --domain Mathematics # filter tasks by domain
sieval dataset show aime_2024 # dataset detail, incl. the YAML path: to paste
sieval dataset download aime_2024 # stage data into $SIEVAL_DATA_DIR

All-in-one (launch inference, evaluate, cleanup — recommended entry point):

sieval run config.yaml
sieval run config.yaml --resume

Evaluate against an already-online endpoint:

sieval eval leaderboards/sft_fast_202511.yaml --model gpt-4o
# `sieval eval` is a shortcut for the underlying resource verb:
sieval leaderboard run leaderboards/sft_fast_202511.yaml --model gpt-4o

Inference management:

sieval infer start /path/to/Qwen3-8B # auto-resolve and launch
sieval infer list # show running services
sieval infer logs qwen3-8b -f # stream engine logs
sieval infer stop qwen3-8b # graceful shutdown

Programmatic usage:

import anyio
from sieval.datasets import MMLUDataset
from sieval.tasks import MMLUZeroShotGenTask
from sieval.core.models import ChatModel
from sieval.core.runners import TaskRunner, TaskRunnerConfig
async def main():
 dataset = MMLUDataset("cais/mmlu")
 model = ChatModel("gpt-4o", max_retries=3, concurrency_limit=128)
 task = MMLUZeroShotGenTask(dataset=dataset, model=model)
 runner = TaskRunner(
 task=task,
 config=TaskRunnerConfig(result_dir="./outputs/mmlu", auto_resume=True),
 )
 results = await runner.arun()
 print(results)
anyio.run(main)

Documentation

Configuration Guide — YAML format, task pipeline, model resource pool, anomaly detection
Concurrency Control — four-level concurrency model
Profiling & Observability — stage timing, I/O metrics, token tracking
Inference Management — full infer subcommand reference

Contributing

See CONTRIBUTING.md for development setup, project architecture, code conventions, and the PR process.

License

Apache License 2.0 — see LICENSE for details.

Citation

@software{sieval2026,
 title = {SiEval: Asynchronous Streaming Evaluation Framework},
 author = {{ScitiX}},
 year = {2026},
 url = {https://github.com/scitix/sieval}
}

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scitix/sieval

Folders and files

Latest commit

History

Repository files navigation

SiEval

Features

Installation

Quick Start

Documentation

Contributing

License

Citation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SiEval

Features

Installation

Quick Start

Documentation

Contributing

License

Citation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages