Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

scitix/sieval

Repository files navigation

SiEval

SiEval is a model delivery quality verification system with an asynchronous streaming evaluation engine, iterative feedback loop, and resilient sharded persistence. It verifies the entire model delivery pipeline — training → conversion → inference → evaluation.

Features

  • Asynchronous streaming — process samples concurrently without waiting for batch completion
  • Iterative feedback loop — multi-turn evaluation with feedback
  • Resilient persistence — sharded, append-only storage for crash recovery
  • 11 mainstream benchmarks — AIME 2024/2025, DROP, GPQA-Diamond, HumanEval, IFEval, LiveCodeBench, MATH-500, MMLU, MMLU-Pro, T-Eval (math, code, reasoning, knowledge, instruction-following, tool-use)
  • Type-safe pipelines — fully typed task stages (preprocess → infer → postprocess → feedback)
  • YAML-based configuration — batch evaluation with model derivation and quota allocation
  • Inference orchestration — recipe-driven inference with auto-resolve and backend abstraction (vLLM, SGLang)
  • Anomaly detection — built-in detection rules for output quality, performance, and correctness
  • Profiling — stage timing, I/O metrics, and token usage tracking

Installation

Requirements: Unix (Linux, macOS), Python ≥ 3.12, PDM (recommended) or pip

git clone https://github.com/scitix/sieval.git
cd sieval
pdm install # or: pip install -e .

Optional extras (per-benchmark dependencies):

pip install -e ".[math]" # AIME 2024/2025, MATH-500 (math-verify)
pip install -e ".[drop]" # DROP (numpy, scipy)
pip install -e ".[ifeval]" # IFEval (absl, langdetect, nltk, immutabledict)
pip install -e ".[t-eval]" # T-Eval (numpy, sentence-transformers)
pip install -e ".[math,drop,ifeval,t-eval]" # all extras at once

Quick Start

Dataset paths below use HuggingFace repo ids for HF-sourced datasets (e.g. HuggingFaceH4/aime_2024) and ${SIEVAL_DATA_DIR}/<name> for URL-sourced datasets. Set SIEVAL_DATA_DIR (default ~/.sieval/data) before running any command that resolves a URL-sourced dataset.

Start from an example — two-step flow:

cp examples/quickstart.yaml eval.yaml
$EDITOR eval.yaml # set model checkpoint + container image
# Step 1: stage the data
sieval dataset download aime_2024
# Step 2: run eval
sieval eval eval.yaml

See examples/README.md for more scenarios (leaderboard, recipe overrides) and examples/hardware/ for hardware-pinned reference configs.

Discover tasks / datasets:

sieval dataset list # registered datasets + licenses + download status
sieval task list --domain Mathematics # filter tasks by domain
sieval dataset show aime_2024 # dataset detail, incl. the YAML path: to paste
sieval dataset download aime_2024 # stage data into $SIEVAL_DATA_DIR

All-in-one (launch inference, evaluate, cleanup — recommended entry point):

sieval run config.yaml
sieval run config.yaml --resume

Evaluate against an already-online endpoint:

sieval eval leaderboards/sft_fast_202511.yaml --model gpt-4o
# `sieval eval` is a shortcut for the underlying resource verb:
sieval leaderboard run leaderboards/sft_fast_202511.yaml --model gpt-4o

Inference management:

sieval infer start /path/to/Qwen3-8B # auto-resolve and launch
sieval infer list # show running services
sieval infer logs qwen3-8b -f # stream engine logs
sieval infer stop qwen3-8b # graceful shutdown

Programmatic usage:

import anyio
from sieval.datasets import MMLUDataset
from sieval.tasks import MMLUZeroShotGenTask
from sieval.core.models import ChatModel
from sieval.core.runners import TaskRunner, TaskRunnerConfig
async def main():
 dataset = MMLUDataset("cais/mmlu")
 model = ChatModel("gpt-4o", max_retries=3, concurrency_limit=128)
 task = MMLUZeroShotGenTask(dataset=dataset, model=model)
 runner = TaskRunner(
 task=task,
 config=TaskRunnerConfig(result_dir="./outputs/mmlu", auto_resume=True),
 )
 results = await runner.arun()
 print(results)
anyio.run(main)

Documentation

Contributing

See CONTRIBUTING.md for development setup, project architecture, code conventions, and the PR process.

License

Apache License 2.0 — see LICENSE for details.

Citation

@software{sieval2026,
 title = {SiEval: Asynchronous Streaming Evaluation Framework},
 author = {{ScitiX}},
 year = {2026},
 url = {https://github.com/scitix/sieval}
}

About

SiEval is a model delivery quality verification system.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

Contributors

Languages

AltStyle によって変換されたページ (->オリジナル) /