Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Releases: OpenInterpretability/mechreward

v0.1.0 — Initial alpha release

15 Apr 00:57
@caiovicentino caiovicentino

Choose a tag to compare

Mechanistic interpretability as reward signal for RL training of LLMs.

Install

pip install mechreward

Highlights

  • FeatureReward — SAE feature activations as trajectory-level reward
  • CompositeReward — HERO-style stratified normalization combining outcome + feature rewards
  • HackingDetector + DualVerifier — Wilhelm-2603.04069 style anti-Goodhart framework
  • AdversarialSuite — 10 canned red-team prompts for reward robustness testing
  • MechRewardGRPOTrainer — drop-in TRL GRPO wrapper with hidden state capture
  • Reference catalogs for Gemma-2-9B (reasoning, confidence, retrieval packs — placeholder features, validate before use)
  • 7 reference experiments in experiments/ covering baseline, mech-only, hybrid, SARM reproduction, CRL reproduction, adversarial suite, capability preservation
  • Outcome verifiers for GSM8K, MATH, HumanEval-style code, Python exec

Status

Alpha. API subject to change. See RESEARCH.md for the scientific context and prior-art audit (SARM, SparseRM, CRL, YaPO, Wilhelm et al.).

Tests

35/35 unit tests passing. Ruff clean.

What's next

  • Validate placeholder features in the Gemma-2-9B reasoning pack against real data
  • Run experiment 3 (hybrid) on Gemma-2-9B + GSM8K
  • Ship adversarial hacking bench in CI
Assets 2
Loading

AltStyle によって変換されたページ (->オリジナル) /