Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

OpenAlign/AlignLab

Repository files navigation

AlignLab, by OpenAlign

@agi_atharv

AlignLab is a comprehensive alignment framework that provides researchers and practitioners with easy to use tools for aligning models across multiple dimensions: safety, truthfulness, bias, toxicity, and agentic robustness.

There is a lot more to add to AlignLab! Please work with us and contribute to the project.

πŸ—οΈ Repository Layout (Monorepo)

alignlab/
β”œβ”€β”€ packages/
β”‚ β”œβ”€β”€ alignlab-core/ # Core Python library (datasets, runners, scoring, reports)
β”‚ β”œβ”€β”€ alignlab-cli/ # CLI wrapping core functionality
β”‚ β”œβ”€β”€ alignlab-evals/ # Thin adapters to lm-eval-harness, OpenAI evals, JailbreakBench, HarmBench
β”‚ β”œβ”€β”€ alignlab-guards/ # Guard models (Llama-Guard adapters), rule engines, ensembles
β”‚ β”œβ”€β”€ alignlab-dash/ # Streamlit/FastAPI dashboard + report viewer
β”‚ └── alignlab-agents/ # Sandboxed agent evals (tools, browsing-locked simulators)
β”œβ”€β”€ benchmarks/ # YAML registry (sources, splits, metrics, licenses)
β”œβ”€β”€ templates/ # Cookiecutter for new evals/benchmarks
β”œβ”€β”€ docs/ # MkDocs documentation
└── examples/ # Runnable notebooks & scripts

🎯 Core Promises

Registry-First Design

Every benchmark is a single YAML entry with fields for taxonomy, language coverage, scoring, and version pinning. Reproducible by default.

Adapters, Not Reinvention

First-class adapters to:

  • lm-evaluation-harness for general QA/MC/MT (EleutherAI)
  • OpenAI Evals for templated eval flows & rubric judging
  • JailbreakBench (attacks/defenses, scoring, leaderboard spec)
  • HarmBench (automated red teaming + robust refusal)

Multilingual Support

Plug-in loaders for multilingual toxicity, truthfulness, and bias datasets.

Guard-Stack

Unified API to run Llama-Guard-3 (and alternatives) as pre/post-filters or judges, with ensemble and calibration support.

Agent Evals

Safe, containerized tools and scenarios with metrics like ASR (attack success rate), query efficiency, and over-refusal.

Reports That Matter

One command yields a NeurIPS-style PDF/HTML with tables, CIs, error-buckets, and taxonomy-level breakdowns.

πŸš€ Quick Start

Installation (Development)

# Install uv for dependency management
pipx install uv
# Create virtual environment and install packages
uv venv
uv pip install -e packages/alignlab-core -e packages/alignlab-cli \
 -e packages/alignlab-evals -e packages/alignlab-guards \
 -e packages/alignlab-agents -e packages/alignlab-dash

Run a Full Safety Sweep

# Run comprehensive safety evaluation
alignlab eval run --suite alignlab:safety_core_v1 \
 --model meta-llama/Llama-3.1-8B-Instruct --provider hf \
 --guards llama_guard_3 --max-samples 200 \
 --report out/safety_core_v1
# Generate reports
alignlab report build out/safety_core_v1 --format html,pdf

CLI Examples

# List available benchmarks and models
alignlab benchmarks ls --filter safety,multilingual
alignlab models ls
# Run a single benchmark
alignlab eval run truthfulqa --split validation --judge llm_rubric
# Agentic robustness (jailbreaks)
alignlab jailbreak run --suite jb_default_v1 --defense none --report out/jb

πŸ“Š Prebuilt Suites

alignlab:safety_core_v1

  • Harm/Risk: HarmBench, JailbreakBench, UltraSafety
  • Toxicity: RealToxicityPrompts, PolygloToxicityPrompts
  • Truthfulness: TruthfulQA + multilingual extensions
  • Bias/Fairness: BBQ, CrowS-Pairs, StereoSet, Multilingual HolisticBias
  • Guards: Optional pre/post Llama-Guard-3

alignlab:agent_robustness_v1

  • JailbreakBench canonical config
  • HarmBench auto red teaming subset
  • Prompt-injection battery

πŸ”§ Key Modules & API

Core Usage

from alignlab.core import EvalRunner, load_benchmark
# Load and run a benchmark
bench = load_benchmark("truthfulqa")
runner = EvalRunner(model="meta-llama/Llama-3.1-8B-Instruct", provider="hf")
result = runner.run(bench, split="validation", judge="exact_match|llm_rubric")
result.to_report("reports/run_001")

Guard Pipeline

from alignlab.guards import LlamaGuard, RuleGuard, EnsembleGuard
guard = EnsembleGuard([
 LlamaGuard("meta-llama/Llama-Guard-3-8B"), 
 RuleGuard.from_yaml("mlc_taxonomy.yaml")
])
safe_out = guard.wrap(model).generate(prompt)

πŸ“‹ Benchmark Registry

Harm/Safety & Jailbreak

  • HarmBench – Automated red teaming + refusal robustness
  • JailbreakBench – NeurIPS'24 D&B artifacts, threat model, scoring
  • UltraSafety – 3k harmful prompts with paired jailbreak prompts

Toxicity (Mono + Multi)

  • RealToxicityPrompts – Canonical 100k prompts
  • PolygloToxicityPrompts – 425k prompts, 17 languages

Truthfulness / Hallucination

  • TruthfulQA – 817 Qs, 38 categories with MC/binary variants

Bias/Fairness (Mono + Multi)

  • BBQ, CrowS-Pairs, StereoSet, Multilingual HolisticBias

πŸ“ Adding a New Benchmark

Create a YAML file in benchmarks/:

# benchmarks/truthfulqa.yaml
id: truthfulqa
source:
 type: huggingface
 repo: "sylinrl/TruthfulQA"
splits: [validation]
task: freeform_qa
judges:
 - exact_match
 - llm_rubric: {model: "gpt-4o-mini", rubric: "truthfulqa_v1"}
metrics: [truthful, informative, truthful_informative]
taxonomy: [truthfulness]
license: "MIT"
version: "2025εΉ΄01月15ζ—₯"

πŸ“Š Reporting & UX

  • Per-taxonomy dashboards with macro/micro scores and CIs
  • Guard deltas and hazard-class confusion matrices
  • Multilingual heatmaps
  • Exportable bundles (JSONL, PDF, HTML)

🀝 Contributing

Contributing Guidelines

Thanks for your interest in contributing! To keep the project stable and manageable, please follow these rules:

πŸ”€ Workflow

  1. Fork the repository (do not create branches directly in this repo).
  2. Create a branch in your fork using the naming convention below.
  3. Make your changes and ensure tests pass.
  4. Open a Pull Request (PR) into the dev branch (or main if no dev branch exists).
  5. Wait for CI to pass and maintainers to review.

🌱 Branch Naming

Branches must follow these patterns:

  • feat/<short-description> β†’ New features
  • fix/<short-description> β†’ Bug fixes
  • docs/<short-description> β†’ Documentation changes
  • test/<short-description> β†’ Testing-related changes

βœ… Examples:

  • feat/add-user-auth
  • fix/navbar-overlap
  • docs/update-readme

❌ Bad examples:

  • patch1
  • update-stuff

βœ… Requirements Before PR

  • Run all tests locally and ensure they pass.
  • Follow code style guidelines (npm run lint or make lint).
  • Write clear commit messages.
  • PR title should match branch type (feat: ..., fix: ..., docs: ...).

πŸ”’ Protected Branches

  • main is protected: no direct commits allowed.
  • All code must be merged via Pull Requests.
  • PRs require review approval + passing checks.

πŸ“œ Code of Conduct

Be respectful and constructive.

See the docs/ directory for comprehensive documentation, including:

  • API reference
  • Benchmark development guide
  • Guard model integration
  • Agent evaluation setup

πŸ“„ License

MIT License - see LICENSE file for details.

About

The everything tool for model alignment

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /