Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

huyuelin/MetaProbe

Repository files navigation

MetaProbe: Auditable Metamorphic Evaluation of Causal Consistency in Repository-Level Coding Agents

MetaProbe Evidence Pipeline

Paper: MetaProbe: Auditable Metamorphic Evaluation of Causal Consistency in Repository-Level Coding Agents
Venue: Submitted to ICSE 2027
Status: Under review (double-blind)


Overview

MetaProbe is an auditable metamorphic evaluation framework that measures evidence-calibrated relational consistency of repository-level coding agents. Unlike pass/fail evaluation on fixed benchmark suites, MetaProbe assesses whether agent repair behavior remains stable under invariant transformations, sensitive under causal relocations, and undistracted by certified irrelevant regions.

The framework introduces:

  • PEAR (Patch-Equivalence and Alternative-Repair): A semantic adjudication protocol that calibrates the automatic region oracle and bounds oracle error propagation.
  • CER (Certificate-supported Evidence Region): A multi-signal evidence combiner integrating witness-patch edits, coverage traces, dynamic slices, generated-test behavior, and mutation evidence.
  • Certificate Checker (L0--L5): A trusted verifier that assigns evidence levels to each transformed task, separating untrusted generation from trusted validation.
  • Independent Verification Loop: Blind dual-region human adjudication and CER-disjoint discriminative tests validate causal claims without self-referential evidence.

Key Results

We evaluate eight fully reproducible agents on 3,257 certificate-validated tasks across SWE-bench Verified, BugsInPy, and Defects4J.

Agent Model (Architecture) Sem. Stability Sem. Sensitivity Distractor Error
Simple RAG Qwen3-7B (7B dense) 48.6% 20.5% 19.7%
SWE-agent + Qwen3 Qwen3-32B (32.8B dense) 55.7% 25.4% 15.3%
Aider + Qwen3 Qwen3-32B (32.8B dense) 57.9% 27.6% 14.1%
OpenHands + GLM-4.6 GLM-4.6 (355B-A32B MoE) 62.1% 33.1% 11.8%
AutoCodeRover + GLM-4.6 GLM-4.6 (355B-A32B MoE) 64.5% 35.0% 10.7%
Agentless + MiniMax-M2 MiniMax-M2 (230B-A10B MoE) 67.6% 37.4% 9.2%
OpenHands + Qwen3 Qwen3-32B (32.8B dense) 72.3% 44.6% 7.1%
OpenHands + DeepSeek-V3.2 DeepSeek-V3.2 (671B-A37B MoE) 74.1% 46.7% 6.5%

Even the strongest agent exhibits a 25.9 pp gap from perfect semantic stability.


Architecture

Relation Semantics and Failure Modes

Figure 2. Geometric intuition behind MetaProbe's predicates: invariant stability, causal sensitivity, and distractor resistance define orthogonal diagnostic dimensions.

MetaProbe follows a generator--checker--adjudicator pipeline:

MRSpec Generator (untrusted)
 --> Certificate Builder (evidence collection)
 --> Certificate Checker (trusted, L0-L5 validation)
 --> Agent Execution (fixed protocol, Docker-isolated)
 --> PEAR Adjudication (oracle calibration)
 --> Independent Verification Loop
 --> Statistical Analyzer (bootstrap + mixed-effects)

Transformation Families

Family Type Obligation
Issue paraphrase Invariant Repair should remain stable
Semantic refactoring Invariant Repair should remain stable
Equivalent test rewrite Invariant Repair should remain stable
Distractor insertion Distractor Agent should ignore certified-unreachable regions
Guard relocation Causal Agent should adapt to relocated fault evidence
API argument relocation Causal Agent should adapt to relocated fault evidence

Certificate Yield and Audit Validity

Certificate Yield and Audit Validity

Figure 4. Certificate funnel: 5,400 attempted transformations yield 3,257 L4+ validated tasks. Human audit confirms 94--98% validity across all families.


PEAR Oracle Calibration

PEAR Calibration Matrix

Figure 5. PEAR adjudication across patch categories. Among region-mismatch patches that pass tests, 20.6% are accepted as equivalent or plausible alternative repairs, confirming the necessity of semantic correction beyond naive region matching.


Metamorphic Adequacy Dashboard

Adequacy Dashboard

Figure 3. Full adequacy picture: original pass rate does not determine semantic stability; causal pass leaves a stale-patch gap; certified distractors remain diagnostic.


MetaProbe-Guided Hardening

Hardening Lift Curve

Figure 6. Development-set failures guide scaffold hardening. Gains generalize to held-out repositories (+7.6 pp stability, +9.1 pp sensitivity), held-out transformation families, and cross-dataset transfer, with negligible impact on original pass rate (+0.5 pp).


Repository Structure

metaprobe/ # Core framework
 __init__.py
 mr_spec.py # MRSpec transformation definitions (6 families)
 certificate.py # Certificate checker (L0-L5 evidence validation)
 cer.py # CER: learned evidence-region combiner
 pear.py # PEAR: patch-equivalence adjudication protocol
 metrics.py # Semantic stability/sensitivity/distractor metrics
 generated_tests.py # Generated-test synthesis and disjointness checks
transforms/ # Transformation generators
 paraphrase.py # Issue-preserving paraphrase (invariant)
 refactor.py # Semantic refactoring (invariant)
 test_rewrite.py # Equivalent test rewrite (invariant)
 distractor.py # Distractor insertion (certified unreachable)
 guard_relocate.py # Guard relocation (causal)
 api_relocate.py # API argument relocation (causal)
agents/ # Agent adapters (version-locked scaffolds)
 swe_agent.py # SWE-agent v0.7
 aider.py # Aider v0.82
 openhands.py # OpenHands v0.34
 autocoderover.py # AutoCodeRover v2.1
 agentless.py # Agentless v1.2
 commercial.py # Closed-source API reference (non-essential)
analysis/ # Statistical analysis and visualization
 bootstrap.py # 10,000 repository-clustered bootstrap resamples
 mixed_effects.py # Mixed-effects logistic regression
 baseline_comparison.py # Equal-budget baseline comparison (RQ4)
 cer_ablation.py # CER feature ablation
 cer_robustness.py # CER parameter robustness scans
 hardening.py # MetaProbe-guided hardening experiments
 hardening_loo.py # Leave-one-out hardening ablation
 hardening_heldout.py # Held-out generalization evaluation
 ipw_correction.py # Inverse-propensity weighting for selection bias
 visualization.py # Figure generation scripts
experiments/ # Experiment runners
 run_main.py # Main evaluation pipeline
 run_swebench.py # SWE-bench harness integration
 analyze_results.py # Result aggregation and table generation
lean/ # Lean 4 formalization (3,790 LOC)
 MetaProbe.lean # Metric self-consistency proofs
data/ # Experimental data (sample subset)
scripts/ # Deployment and utility scripts
tests/ # Unit and integration tests
docs/figures/ # Paper figures (PNG)

Quick Start

Prerequisites

pip install -r requirements.txt

Required environment variables for LLM API access:

export HUNYUAN_API_KEY="your-hunyuan-api-key"
export DASHSCOPE_API_KEY="your-dashscope-api-key"

Run in Mock Mode (No GPU Required)

python experiments/run_main.py
python experiments/analyze_results.py

Full Reproduction

# Single-command reproduction of main results table
metaprobe reproduce --table main-agent --seed 0
# Verify a specific certificate
metaprobe verify --task TASK_ID --certificate cert.json
# Score predictions against MetaProbe oracle
metaprobe score --predictions patches.jsonl --metric semantic
# Run stratified human audit
metaprobe audit --sample stratified --output audit.jsonl

GPU Deployment

All experiments use 8x A800 GPUs with vLLM/SGLang serving:

Model Quantization GPUs Required
DeepSeek-V3.2 (671B-A37B MoE) INT8 4--5
GLM-4.6 (355B-A32B MoE) INT8 4
MiniMax-M2 (230B-A10B MoE) INT8 3
Qwen3-32B (32.8B dense) FP16 2

Reproducibility Guarantees

Every result in the paper traces to auditable artifacts:

  • Model checkpoints: All evaluated agents use publicly downloadable, version-locked open-source checkpoints with exact parameter counts and architecture types. Checkpoint SHA-256 hashes are recorded.
  • Scaffold versions: Git-commit-pinned (e.g., SWE-agent v0.7 @ commit_hash).
  • Docker environments: Fixed Docker digest hashes ensure deterministic execution sandboxes.
  • Random seeds: Temperature fixed at 0 for main tables; seeds 0--4 for robustness analysis.
  • Trajectory logs: Full agent trajectories stored in results/raw/{agent}/{task_id}.jsonl.

Artifact Components

Component Description Status
Open-Lite Public split, MRSpec files, certificates, scoring scripts Public on acceptance
Open-Full All transformed task diffs, certificate JSONs, coverage traces, mutation results, PEAR labels, human-audit records Public
Trajectories Full agent trajectories for all open-source agents (JSONL) Public
Regeneration-Kit Transformation builders, checkers, Dockerfiles, pinned dependency manifests Public
Leaderboard-Kit Patch-submission schema and scoring command Public
Lean Proofs Metric self-consistency formalization (3,790 LOC, Lean 4) Public

Lean 4 Formalization

The Lean development verifies internal consistency of analyzer and checker definitions:

Property Status
SemOK => Pass (semantic implies test pass) Verified
Sens_sem and stale are non-overlapping and additive Verified
L0 -> L5 monotonicity (certificate levels) Verified
MAI_w in [0,1] (weighted adequacy index bounded) Verified
Distractor edit does not imply SemOK (counterexample exclusion) Verified
Analyzer determinism (fixed inputs => unique output) Verified

Scope: definition-level consistency only. Does not prove properties about agent behavior or program correctness.

# Check Lean proofs
lake build

Citation

@inproceedings{metaprobe2027,
 title = {MetaProbe: Auditable Metamorphic Evaluation of Causal Consistency
 in Repository-Level Coding Agents},
 author = {Anonymous},
 booktitle = {Proceedings of the 49th International Conference on Software Engineering (ICSE)},
 year = {2027},
 note = {Under review}
}

License

This repository is released for academic research purposes. See LICENSE for details.


Acknowledgments

We thank the annotators who participated in blind human adjudication and the external users who completed the artifact usability study. All experiments were conducted on institutional GPU clusters.

About

MetaProbe: Auditable Metamorphic Evaluation of Causal Consistency in Repository-Level Coding Agents (ICSE 2027)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

AltStyle によって変換されたページ (->オリジナル) /