@agi_atharv
AlignLab is a comprehensive alignment framework that provides researchers and practitioners with easy to use tools for aligning models across multiple dimensions: safety, truthfulness, bias, toxicity, and agentic robustness.
There is a lot more to add to AlignLab! Please work with us and contribute to the project.
alignlab/
βββ packages/
β βββ alignlab-core/ # Core Python library (datasets, runners, scoring, reports)
β βββ alignlab-cli/ # CLI wrapping core functionality
β βββ alignlab-evals/ # Thin adapters to lm-eval-harness, OpenAI evals, JailbreakBench, HarmBench
β βββ alignlab-guards/ # Guard models (Llama-Guard adapters), rule engines, ensembles
β βββ alignlab-dash/ # Streamlit/FastAPI dashboard + report viewer
β βββ alignlab-agents/ # Sandboxed agent evals (tools, browsing-locked simulators)
βββ benchmarks/ # YAML registry (sources, splits, metrics, licenses)
βββ templates/ # Cookiecutter for new evals/benchmarks
βββ docs/ # MkDocs documentation
βββ examples/ # Runnable notebooks & scripts
Every benchmark is a single YAML entry with fields for taxonomy, language coverage, scoring, and version pinning. Reproducible by default.
First-class adapters to:
- lm-evaluation-harness for general QA/MC/MT (EleutherAI)
- OpenAI Evals for templated eval flows & rubric judging
- JailbreakBench (attacks/defenses, scoring, leaderboard spec)
- HarmBench (automated red teaming + robust refusal)
Plug-in loaders for multilingual toxicity, truthfulness, and bias datasets.
Unified API to run Llama-Guard-3 (and alternatives) as pre/post-filters or judges, with ensemble and calibration support.
Safe, containerized tools and scenarios with metrics like ASR (attack success rate), query efficiency, and over-refusal.
One command yields a NeurIPS-style PDF/HTML with tables, CIs, error-buckets, and taxonomy-level breakdowns.
# Install uv for dependency management pipx install uv # Create virtual environment and install packages uv venv uv pip install -e packages/alignlab-core -e packages/alignlab-cli \ -e packages/alignlab-evals -e packages/alignlab-guards \ -e packages/alignlab-agents -e packages/alignlab-dash
# Run comprehensive safety evaluation alignlab eval run --suite alignlab:safety_core_v1 \ --model meta-llama/Llama-3.1-8B-Instruct --provider hf \ --guards llama_guard_3 --max-samples 200 \ --report out/safety_core_v1 # Generate reports alignlab report build out/safety_core_v1 --format html,pdf
# List available benchmarks and models alignlab benchmarks ls --filter safety,multilingual alignlab models ls # Run a single benchmark alignlab eval run truthfulqa --split validation --judge llm_rubric # Agentic robustness (jailbreaks) alignlab jailbreak run --suite jb_default_v1 --defense none --report out/jb
- Harm/Risk: HarmBench, JailbreakBench, UltraSafety
- Toxicity: RealToxicityPrompts, PolygloToxicityPrompts
- Truthfulness: TruthfulQA + multilingual extensions
- Bias/Fairness: BBQ, CrowS-Pairs, StereoSet, Multilingual HolisticBias
- Guards: Optional pre/post Llama-Guard-3
- JailbreakBench canonical config
- HarmBench auto red teaming subset
- Prompt-injection battery
from alignlab.core import EvalRunner, load_benchmark # Load and run a benchmark bench = load_benchmark("truthfulqa") runner = EvalRunner(model="meta-llama/Llama-3.1-8B-Instruct", provider="hf") result = runner.run(bench, split="validation", judge="exact_match|llm_rubric") result.to_report("reports/run_001")
from alignlab.guards import LlamaGuard, RuleGuard, EnsembleGuard guard = EnsembleGuard([ LlamaGuard("meta-llama/Llama-Guard-3-8B"), RuleGuard.from_yaml("mlc_taxonomy.yaml") ]) safe_out = guard.wrap(model).generate(prompt)
- HarmBench β Automated red teaming + refusal robustness
- JailbreakBench β NeurIPS'24 D&B artifacts, threat model, scoring
- UltraSafety β 3k harmful prompts with paired jailbreak prompts
- RealToxicityPrompts β Canonical 100k prompts
- PolygloToxicityPrompts β 425k prompts, 17 languages
- TruthfulQA β 817 Qs, 38 categories with MC/binary variants
- BBQ, CrowS-Pairs, StereoSet, Multilingual HolisticBias
Create a YAML file in benchmarks/:
# benchmarks/truthfulqa.yaml id: truthfulqa source: type: huggingface repo: "sylinrl/TruthfulQA" splits: [validation] task: freeform_qa judges: - exact_match - llm_rubric: {model: "gpt-4o-mini", rubric: "truthfulqa_v1"} metrics: [truthful, informative, truthful_informative] taxonomy: [truthfulness] license: "MIT" version: "2025εΉ΄01ζ15ζ₯"
- Per-taxonomy dashboards with macro/micro scores and CIs
- Guard deltas and hazard-class confusion matrices
- Multilingual heatmaps
- Exportable bundles (JSONL, PDF, HTML)
Thanks for your interest in contributing! To keep the project stable and manageable, please follow these rules:
- Fork the repository (do not create branches directly in this repo).
- Create a branch in your fork using the naming convention below.
- Make your changes and ensure tests pass.
- Open a Pull Request (PR) into the
devbranch (ormainif no dev branch exists). - Wait for CI to pass and maintainers to review.
Branches must follow these patterns:
feat/<short-description>β New featuresfix/<short-description>β Bug fixesdocs/<short-description>β Documentation changestest/<short-description>β Testing-related changes
β Examples:
feat/add-user-authfix/navbar-overlapdocs/update-readme
β Bad examples:
patch1update-stuff
- Run all tests locally and ensure they pass.
- Follow code style guidelines (
npm run lintormake lint). - Write clear commit messages.
- PR title should match branch type (
feat: ...,fix: ...,docs: ...).
mainis protected: no direct commits allowed.- All code must be merged via Pull Requests.
- PRs require review approval + passing checks.
Be respectful and constructive.
See the docs/ directory for comprehensive documentation, including:
- API reference
- Benchmark development guide
- Guard model integration
- Agent evaluation setup
MIT License - see LICENSE file for details.