Name	Name	Last commit message	Last commit date
Latest commit History 19 Commits
bench_loop	bench_loop
operators	operators
scripts	scripts
tasks	tasks
tests	tests
.gitignore	.gitignore
AUDIT-2026年04月24日.md	AUDIT-2026年04月24日.md
AUDIT-2026年05月12日.md	AUDIT-2026年05月12日.md
AUDIT-FIX-PLAN-2026年05月11日.md	AUDIT-FIX-PLAN-2026年05月11日.md
LICENSE	LICENSE
README.md	README.md
RELEASE.md	RELEASE.md
SPEC.md	SPEC.md
pyproject.toml	pyproject.toml
uv.lock	uv.lock

BenchLoop

Benchmark local LLMs by what actually matters.

BenchLoop is a local-first CLI + web app for benchmarking LLMs running on your own hardware or cloud providers. It scores models across seven repeatable suites — quality, speed, reliability, agentic tool use, coding, instruction following — and gives you receipts: per-task outputs, latency, token counts, machine info, scores.

No accounts, no telemetry. Local models need no API keys; cloud providers use standard OpenAI-compatible auth. Your model, your machine (or your provider), your numbers.

$ benchloop run --model qwen3:8b --suites speed,toolcall,agent
... 8 tasks, 4 tools, 6 turns avg, 74.6 tok/s ...
Overall 73.4 ████████░░
Quality 73.6 ████████░░
Speed 78.9 █████████░
Agent 96.9 █████████▌

Published runs live at https://bench-loop.com/leaderboard. Every completed local benchmark auto-publishes there.

Why

Hosted LLM leaderboards answer "which model wins on a server farm someone else paid for?" BenchLoop answers "which model + harness + hardware combination actually works for me right now?" — the question you have when picking a local stack.

It is repeatable on purpose: every run persists to disk, the task set is frozen, the scorer is deterministic. If you say "qwen3:8b scored 89 on my 4090", anyone can install BenchLoop and verify it.

Install

pipx (recommended)

pipx install benchloop-cli
benchloop --version

The PyPI distribution is named benchloop-cli (the bare benchloop name was taken by an unrelated dataset library). The installed commands are still benchloop and bench-loop.

pip

pip install benchloop-cli

From source

git clone https://github.com/outsourc-e/bench-loop
cd bench-loop
pip install -e .

Run your first benchmark

Make sure you have a local LLM endpoint running. Anything OpenAI-compatible or Ollama-flavored works:

Ollama at http://localhost:11434 (default)
LM Studio at http://localhost:1234 (--provider openai_compat)
MLX / Osaurus at http://localhost:8000 (--provider openai_compat)
vLLM, Jan, llama-server, etc.

Then:

benchloop run \
 --model qwen3:8b \
 --endpoint http://localhost:11434 \
 --provider ollama

This runs every default suite, scores them, prints a console report, and persists the full run to ~/.bench-loop/runs/.

Run a subset

benchloop run --model qwen3:8b --suites speed,agent

Different prompting harness

Same model, four ways to talk to it:

benchloop run --model qwen3:8b --harness raw # native tool calling
benchloop run --model qwen3:8b --harness hermes # <tool_call>{...}</tool_call>
benchloop run --model qwen3:8b --harness qwen # <function_call>{...}</function_call>
benchloop run --model qwen3:8b --harness pi # <think>...</think> + Hermes tags

Stamp custom hardware (e.g. when benchmarking through a tunnel)

benchloop run \
 --model qwen3:8b \
 --endpoint http://localhost:11435 \
 --hardware "NVIDIA RTX 4090 24GB" \
 --gpu "NVIDIA RTX 4090" \
 --gpu-memory-gb 24

Benchmark cloud/remote APIs

Works with any OpenAI-compatible endpoint — DashScope, OpenRouter, Together, OpenAI, vLLM with auth, sglang, etc.

# Via environment variable
export OPENAI_API_KEY="sk-..."
benchloop run \
 --model qwen3.7-max \
 --provider openai_compat \
 --endpoint https://dashscope-intl.aliyuncs.com/compatible-mode \
 --remote
# Or inline
benchloop run \
 --model gpt-4o \
 --provider openai_compat \
 --endpoint https://api.openai.com/v1 \
 --api-key sk-... \
 --remote

The --remote flag (auto-detected for non-localhost endpoints) switches to cloud-aware scoring:

Speed uses streaming TTFT (time-to-first-token) + effective content tok/s
Overall = 0.50·quality + 0.25·speed + 0.25·reliability (vs local's 0.55/0.20/0.25)
Reasoning models: content tok/s excludes internal thinking tokens

API key auth

Required for vLLM, sglang, and most cloud providers. Two ways to provide it:

# 1. Environment variable (recommended)
export OPENAI_API_KEY="your-key-here"
benchloop run --model your-model --provider openai_compat --endpoint http://your-server:8000
# 2. CLI flag
benchloop run --model your-model --provider openai_compat --endpoint http://your-server:8000 --api-key your-key-here

The CLI flag takes precedence over the env var. For Ollama and local providers without auth, neither is needed.

Launch the local dashboard

v0.2.0+ ships the full FastAPI + React dashboard inside the wheel. After pipx install benchloop-cli:

benchloop dashboard
# → open http://127.0.0.1:8877

Need it to survive browser/terminal churn? Print a service template instead of keeping the dashboard tied to one shell:

benchloop dashboard --service-template launchd
benchloop dashboard --service-template systemd
benchloop dashboard --service-template windows-task

This serves the Models, Benchmark, Leaderboard, Compare, and Chat tabs on a single port, with auto-discovered local providers (Ollama, LM Studio, MLX/Osaurus, vLLM, Jan).

For hot-reload development against a clone of bench-loop-web:

benchloop dashboard --dev

Suites

Suite	What it scores
`speed`	Latency, throughput, TTFT, generation tok/s across short/medium/long contexts
`toolcall`	Structured tool-call correctness across realistic tasks (weather, stocks, email, search)
`coding`	Executable Python tasks verified in a sandboxed subprocess (10s timeout)
`dataextract`	JSON / structured extraction from messy natural language
`instructfollow`	Constraint following, formatting, exactness
`reasonmath`	Small reasoning + math tasks with deterministic checks
`agent`	Multi-turn agentic tool use. BenchLoop drives a real loop: model emits a tool call, BenchLoop executes it locally, feeds the result back, model iterates until done. Scores correctness, efficiency, no-hallucination, required-tool coverage.

Scoring

Local: Overall = 0.55 · quality + 0.20 · speed + 0.25 · reliability
Cloud: Overall = 0.50 · quality + 0.25 · speed + 0.25 · reliability (with streaming speed data)
 Overall = 0.65 · quality + 0.35 · reliability (no speed data)

Quality = mean of non-speed suite scores (size-fair).
Speed (local) = 12.54 · log2(tok/s) + 0.9, clamped to 0–100.
Speed (cloud) = 0.60 · TTFT_score + 0.40 · tok/s_score, where TTFT uses exponential decay (200ms→100, 2000ms→40) and tok/s uses a log curve calibrated for 20-150 tok/s.
Reliability = pass rate across all tasks.
Agent = correct_final + efficient + no_hallucinated_tools + all_required_called, 25 pts each, averaged across tasks.

Local web app

A FastAPI backend + React frontend bundle ships alongside the CLI for visualizing runs:

benchloop dashboard # starts the local web app on :5180

Tabs: Models, Benchmark, Leaderboard, Compare runs, Chat, agent trace viewer.

Publish a run

Every completed benchmark auto-publishes to https://bench-loop.com/leaderboard via https://api.bench-loop.com/submit. Runs are deduped by (machine_id, run_id) so the same run from the same machine won't be double-counted.

Opt out:

export BENCHLOOP_NO_SUBMIT=1

You can still manually export a snapshot for sharing / archiving:

benchloop export --output my-runs.json

Architecture

bench-loop/ ← this repo, the CLI + suites + scorers
 bench_loop/
 cli.py ← `benchloop` entrypoint
 suites/ ← speed, toolcall, coding, agent, ...
 harness.py ← raw / hermes / qwen / pi adapters
 providers/ ← ollama, openai_compat
 runner/orchestrator.py ← drives suites + harnesses
 tasks/ ← frozen task YAML fixtures
bench-loop-web/ ← the web app (separate repo)
 api/ ← FastAPI wrapper around bench_loop
 ui/ ← local dashboard
 site/ ← public bench-loop.com static site

Status

BenchLoop is v0.2 beta. The benchmark surface, scoring, web app, agent loop, four harnesses, and cloud provider support all work end-to-end. Stuff still on the roadmap:

~~(削除) Streaming TTFT for OpenAI-compatible providers (削除ここまで)~~ ✅ (v0.2.3+ with --remote)
Bigger task fixtures (each suite is intentionally small and frozen for v1)
Hosted submission flow for community runs
Cloud-specific leaderboard on bench-loop.com (filter by local vs remote)
More provider adapters (TGI, Bedrock, etc. if there's demand)

License

MIT. See LICENSE.

Folders and files

Latest commit

History

Repository files navigation

BenchLoop

Why

Install

pipx (recommended)

pip

From source

Run your first benchmark

Run a subset

Different prompting harness

Stamp custom hardware (e.g. when benchmarking through a tunnel)

Benchmark cloud/remote APIs

API key auth

Launch the local dashboard

Suites

Scoring

Local web app

Publish a run

Architecture

Status

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages