Score any assistant's answers against your own knowledge base and policies — combining deterministic ML/NLP metrics with a multi-provider LLM judge, so you know exactly what to fix and why.
License: Apache 2.0 Python 3.11+ Next.js 14 FastAPI Local-first PRs welcome
Features • How it works • Quickstart • Docs • Architecture • Providers • License
EvalBot — evaluate a chatbot response against ground truth with hybrid ML + AI scoring
Most chatbot QA is vibes-based: someone reads a few answers and decides whether they "feel right." EvalBot replaces that with a repeatable, explainable score.
For each chatbot under test you create a Bot Project and give it two things:
- Documents — the same corpus your bot is grounded on (PDF / MD / TXT / DOCX), chunked and embedded into a local vector store.
- Guideline files — free-form Markdown describing your policies ("never reveal another user's data," "always refuse competitor-pricing questions," tone, required disclaimers, refusal patterns). No schema — the AI judge reads them verbatim.
Then you paste a (question, chatbot answer) pair. EvalBot generates the correct answer from your docs + guidelines, scores the bot's answer against it from two independent angles, and tells you precisely where it fell short — including any guideline violations, with the offending sentence quoted.
It runs entirely on your machine. No Docker, no account, no data leaving your laptop — and a fully offline mode via Ollama.
Short and to the point — read in a couple of minutes:
- How it works — the scoring pipeline + user workflow, with flowcharts.
- Usage guide — evaluate a bot, step by step.
- Glossary — every term in plain English.
- How EvalBot compares — vs promptfoo, deepeval, garak & friends.
| 🎯 Hybrid scoring | A deterministic ML/NLP engine (semantic + lexical similarity, completeness, accuracy, relevance, readability, toxicity, numeric/factual consistency) runs alongside an LLM-as-judge. Engine disagreement is itself a tracked signal. |
| 📚 Grounded ground truth | Reference answers are generated on the fly from the top-k retrieved chunks of your documents — not from the model's general knowledge — and cached per question. |
| 🛡️ Guideline compliance | The judge reads your raw policy files and returns each violation with the offending span quoted and a one-line reason. |
| 🔌 Multi-provider judges | Claude, Gemini, OpenAI, and Ollama (offline). Switch per evaluation, or run several and compare inter-judge agreement. |
| 🗂️ Datasets & runs | Bundle questions into datasets, run them as a batch, and track results across runs with a pass/fail heatmap. |
| 💬 Multi-turn conversations | Evaluate full multi-step exchanges, not just single Q&A pairs. |
| ✅ Custom checks | Define your own pass/fail assertions on top of the built-in metrics. |
| 🔐 PII detection | A built-in engine flags personal data leakage in responses. |
| 📈 Analytics | Pass rate, score trends, ML-vs-AI agreement, performance by category & dimension, and CSV export. |
| 🧪 Seed question library | Ships with Security, Harmfulness, Fact-Check, and Hallucination probes. Save your own per project. |
| 💻 Local-first | FastAPI + Next.js, SQLite + embedded Chroma. Zero infra to stand up. |
EvalBot analytics dashboard — pass rate, evaluator agreement, and performance by category
When you submit a (question, chatbot response) pair for a project:
- Retrieve — EvalBot pulls the top-k relevant chunks from that project's documents.
- Reference — an LLM, grounded on those chunks + the full text of your guideline files, drafts the correct answer (cached per question).
- Score — two evaluators independently grade the triple
(question, response, reference):- ML/NLP engine — fast, deterministic, no API calls.
- AI judge — per-dimension scores, a plain-language rationale, and guideline-violation findings.
- Combine — the two are merged into a Combined Score, and the result is stored for history and analytics.
combined = 0.35·similarity + 0.25·accuracy + 0.25·completeness + 0.10·relevance + 0.05·readability
final_score = (ml_score + ai_score) / 2 # when both engines run
| Dimension | Weight |
|---|---|
| Similarity | 35% |
| Accuracy | 25% |
| Completeness | 25% |
| Relevance | 10% |
| Readability | 5% |
Weights are configurable per run. Default pass threshold: final_score ≥ 75.
EvalBot runs as two local processes — a FastAPI backend and a Next.js frontend.
- Python 3.11+ and uv
- Node 20+ and pnpm 9+ (
npm i -g pnpm) - (Optional) Ollama — for a fully offline AI judge
cd server cp .env.example .env # add a key for whichever provider you'll use uv sync # install deps into ./server/.venv uv run uvicorn app.main:app --reload --port 8000
API → http://localhost:8000 · interactive docs → http://localhost:8000/docs
cd client pnpm install pnpm dev # http://localhost:3000
curl http://localhost:8000/api/health # -> {"status":"ok"} curl http://localhost:8000/api/questions # -> seed question library
First run downloads the MiniLM embedding model (~80 MB) into the local HuggingFace cache; subsequent runs are instant. All state (SQLite DB, Chroma index, uploaded docs & guidelines) lives under
server/data/, which is gitignored.
cd server uv run python -m app.seed_demo # creates a sample project + datasets uv run python -m app.seed_demo_activity # adds historical runs for the analytics views
A pluggable provider layer (server/app/engines/judges/) lets any supported model act as the judge — selectable per evaluation in the UI and configurable via env.
| Provider | Env | Example models |
|---|---|---|
| Anthropic Claude | ANTHROPIC_API_KEY |
claude-opus-4-8, claude-sonnet-4-6, claude-haiku-4-5 |
| Google Gemini | GEMINI_API_KEY |
gemini-1.5-pro, gemini-1.5-flash |
| OpenAI | OPENAI_API_KEY |
gpt-4o, gpt-4o-mini |
| Ollama (offline) | OLLAMA_BASE_URL |
llama3.1, qwen2.5, mistral |
Set a default with AI_JUDGE_PROVIDER; override per request from the Evaluate page.
Run fully offline with Ollama
brew install ollama # macOS (or download from ollama.com) ollama serve ollama pull llama3.1 # ~4.7 GB
Then in server/.env:
AI_JUDGE_PROVIDER=ollama OLLAMA_BASE_URL=http://localhost:11434 OLLAMA_MODEL=llama3.1
Restart the server and confirm Ollama is reachable:
curl http://localhost:11434/api/tags
Ollama is free and private, but quality trails frontier cloud models. For code-heavy or long-context evaluation try qwen2.5-coder or llama3.1:70b (~40 GB RAM).
Frontend — Next.js 14 (App Router) · React 18 · TypeScript · Tailwind CSS · Recharts · Framer Motion · TanStack Query
Backend — FastAPI · SQLModel/SQLite · Chroma (embedded vector store) · sentence-transformers
ML/NLP — sentence-transformers (MiniLM) · rapidfuzz · textstat · regex/entity consistency
AI judges — Anthropic · Google · OpenAI · Ollama, behind one JudgeProvider interface
evalbot/
├── client/ # Next.js app
│ ├── app/
│ │ ├── projects/[id]/ # overview · evaluate · datasets · checks · analytics · activity · runs
│ │ └── evaluations/[id]/ # full result breakdown
│ ├── components/ # score tiles, metric bars, heatmap, charts, dialogs
│ └── lib/ # api client, formatting, thresholds
├── server/ # FastAPI app
│ ├── app/
│ │ ├── api/ # projects, documents, guidelines, evaluate, datasets,
│ │ │ # conversations, custom_checks, analytics, questions, ...
│ │ ├── engines/
│ │ │ ├── rag.py # retrieval + grounded reference generation
│ │ │ ├── ai.py # AI-judge dispatcher
│ │ │ ├── judges/ # anthropic · gemini · openai · ollama
│ │ │ ├── rules.py # ML/NLP scoring
│ │ │ ├── pii.py # PII detection
│ │ │ └── web.py # web-context fetcher
│ │ ├── models.py # SQLModel tables
│ │ ├── scoring.py # weighted combination
│ │ ├── seed_*.py # demo data + seed question library
│ │ └── main.py
│ └── pyproject.toml
└── assets/ # screenshots
- Live chatbot connectors (evaluate against an endpoint instead of pasted answers)
- Severity-tagged guideline findings feeding the numeric score
- Scheduled / CI-triggered dataset runs with regression alerts
- Authentication & multi-user workspaces
Have an idea? Open an issue.
Issues and PRs are welcome — see CONTRIBUTING.md and the good first issue list. For larger changes, open an issue first to discuss direction. Keep PRs scoped and note what you verified.
EvalBot is licensed under the Apache License 2.0 — see LICENSE and NOTICE.
You're free to use, modify, and distribute it, including commercially. The license asks only that you keep the copyright and NOTICE attribution intact so the origin of the work stays clear; the "EvalBot" name and marks are not licensed for your own products (Apache-2.0 §6).
💡 If you adopt or build on EvalBot, we'd love to feature your logo and a short case study — see
NOTICEfor how to reach out. It's an open invitation, never a condition of the license.