Paper: MetaProbe: Auditable Metamorphic Evaluation of Causal Consistency in Repository-Level Coding Agents
Venue: Submitted to ICSE 2027
Status: Under review (double-blind)
MetaProbe is an auditable metamorphic evaluation framework that measures evidence-calibrated relational consistency of repository-level coding agents. Unlike pass/fail evaluation on fixed benchmark suites, MetaProbe assesses whether agent repair behavior remains stable under invariant transformations, sensitive under causal relocations, and undistracted by certified irrelevant regions.
The framework introduces:
- PEAR (Patch-Equivalence and Alternative-Repair): A semantic adjudication protocol that calibrates the automatic region oracle and bounds oracle error propagation.
- CER (Certificate-supported Evidence Region): A multi-signal evidence combiner integrating witness-patch edits, coverage traces, dynamic slices, generated-test behavior, and mutation evidence.
- Certificate Checker (L0--L5): A trusted verifier that assigns evidence levels to each transformed task, separating untrusted generation from trusted validation.
- Independent Verification Loop: Blind dual-region human adjudication and CER-disjoint discriminative tests validate causal claims without self-referential evidence.
We evaluate eight fully reproducible agents on 3,257 certificate-validated tasks across SWE-bench Verified, BugsInPy, and Defects4J.
| Agent | Model (Architecture) | Sem. Stability | Sem. Sensitivity | Distractor Error |
|---|---|---|---|---|
| Simple RAG | Qwen3-7B (7B dense) | 48.6% | 20.5% | 19.7% |
| SWE-agent + Qwen3 | Qwen3-32B (32.8B dense) | 55.7% | 25.4% | 15.3% |
| Aider + Qwen3 | Qwen3-32B (32.8B dense) | 57.9% | 27.6% | 14.1% |
| OpenHands + GLM-4.6 | GLM-4.6 (355B-A32B MoE) | 62.1% | 33.1% | 11.8% |
| AutoCodeRover + GLM-4.6 | GLM-4.6 (355B-A32B MoE) | 64.5% | 35.0% | 10.7% |
| Agentless + MiniMax-M2 | MiniMax-M2 (230B-A10B MoE) | 67.6% | 37.4% | 9.2% |
| OpenHands + Qwen3 | Qwen3-32B (32.8B dense) | 72.3% | 44.6% | 7.1% |
| OpenHands + DeepSeek-V3.2 | DeepSeek-V3.2 (671B-A37B MoE) | 74.1% | 46.7% | 6.5% |
Even the strongest agent exhibits a 25.9 pp gap from perfect semantic stability.
Relation Semantics and Failure Modes
Figure 2. Geometric intuition behind MetaProbe's predicates: invariant stability, causal sensitivity, and distractor resistance define orthogonal diagnostic dimensions.
MetaProbe follows a generator--checker--adjudicator pipeline:
MRSpec Generator (untrusted)
--> Certificate Builder (evidence collection)
--> Certificate Checker (trusted, L0-L5 validation)
--> Agent Execution (fixed protocol, Docker-isolated)
--> PEAR Adjudication (oracle calibration)
--> Independent Verification Loop
--> Statistical Analyzer (bootstrap + mixed-effects)
| Family | Type | Obligation |
|---|---|---|
| Issue paraphrase | Invariant | Repair should remain stable |
| Semantic refactoring | Invariant | Repair should remain stable |
| Equivalent test rewrite | Invariant | Repair should remain stable |
| Distractor insertion | Distractor | Agent should ignore certified-unreachable regions |
| Guard relocation | Causal | Agent should adapt to relocated fault evidence |
| API argument relocation | Causal | Agent should adapt to relocated fault evidence |
Certificate Yield and Audit Validity
Figure 4. Certificate funnel: 5,400 attempted transformations yield 3,257 L4+ validated tasks. Human audit confirms 94--98% validity across all families.
Figure 5. PEAR adjudication across patch categories. Among region-mismatch patches that pass tests, 20.6% are accepted as equivalent or plausible alternative repairs, confirming the necessity of semantic correction beyond naive region matching.
Figure 3. Full adequacy picture: original pass rate does not determine semantic stability; causal pass leaves a stale-patch gap; certified distractors remain diagnostic.
Figure 6. Development-set failures guide scaffold hardening. Gains generalize to held-out repositories (+7.6 pp stability, +9.1 pp sensitivity), held-out transformation families, and cross-dataset transfer, with negligible impact on original pass rate (+0.5 pp).
metaprobe/ # Core framework
__init__.py
mr_spec.py # MRSpec transformation definitions (6 families)
certificate.py # Certificate checker (L0-L5 evidence validation)
cer.py # CER: learned evidence-region combiner
pear.py # PEAR: patch-equivalence adjudication protocol
metrics.py # Semantic stability/sensitivity/distractor metrics
generated_tests.py # Generated-test synthesis and disjointness checks
transforms/ # Transformation generators
paraphrase.py # Issue-preserving paraphrase (invariant)
refactor.py # Semantic refactoring (invariant)
test_rewrite.py # Equivalent test rewrite (invariant)
distractor.py # Distractor insertion (certified unreachable)
guard_relocate.py # Guard relocation (causal)
api_relocate.py # API argument relocation (causal)
agents/ # Agent adapters (version-locked scaffolds)
swe_agent.py # SWE-agent v0.7
aider.py # Aider v0.82
openhands.py # OpenHands v0.34
autocoderover.py # AutoCodeRover v2.1
agentless.py # Agentless v1.2
commercial.py # Closed-source API reference (non-essential)
analysis/ # Statistical analysis and visualization
bootstrap.py # 10,000 repository-clustered bootstrap resamples
mixed_effects.py # Mixed-effects logistic regression
baseline_comparison.py # Equal-budget baseline comparison (RQ4)
cer_ablation.py # CER feature ablation
cer_robustness.py # CER parameter robustness scans
hardening.py # MetaProbe-guided hardening experiments
hardening_loo.py # Leave-one-out hardening ablation
hardening_heldout.py # Held-out generalization evaluation
ipw_correction.py # Inverse-propensity weighting for selection bias
visualization.py # Figure generation scripts
experiments/ # Experiment runners
run_main.py # Main evaluation pipeline
run_swebench.py # SWE-bench harness integration
analyze_results.py # Result aggregation and table generation
lean/ # Lean 4 formalization (3,790 LOC)
MetaProbe.lean # Metric self-consistency proofs
data/ # Experimental data (sample subset)
scripts/ # Deployment and utility scripts
tests/ # Unit and integration tests
docs/figures/ # Paper figures (PNG)
pip install -r requirements.txt
Required environment variables for LLM API access:
export HUNYUAN_API_KEY="your-hunyuan-api-key" export DASHSCOPE_API_KEY="your-dashscope-api-key"
python experiments/run_main.py python experiments/analyze_results.py
# Single-command reproduction of main results table metaprobe reproduce --table main-agent --seed 0 # Verify a specific certificate metaprobe verify --task TASK_ID --certificate cert.json # Score predictions against MetaProbe oracle metaprobe score --predictions patches.jsonl --metric semantic # Run stratified human audit metaprobe audit --sample stratified --output audit.jsonl
All experiments use 8x A800 GPUs with vLLM/SGLang serving:
| Model | Quantization | GPUs Required |
|---|---|---|
| DeepSeek-V3.2 (671B-A37B MoE) | INT8 | 4--5 |
| GLM-4.6 (355B-A32B MoE) | INT8 | 4 |
| MiniMax-M2 (230B-A10B MoE) | INT8 | 3 |
| Qwen3-32B (32.8B dense) | FP16 | 2 |
Every result in the paper traces to auditable artifacts:
- Model checkpoints: All evaluated agents use publicly downloadable, version-locked open-source checkpoints with exact parameter counts and architecture types. Checkpoint SHA-256 hashes are recorded.
- Scaffold versions: Git-commit-pinned (e.g., SWE-agent v0.7 @
commit_hash). - Docker environments: Fixed Docker digest hashes ensure deterministic execution sandboxes.
- Random seeds: Temperature fixed at 0 for main tables; seeds 0--4 for robustness analysis.
- Trajectory logs: Full agent trajectories stored in
results/raw/{agent}/{task_id}.jsonl.
| Component | Description | Status |
|---|---|---|
| Open-Lite | Public split, MRSpec files, certificates, scoring scripts | Public on acceptance |
| Open-Full | All transformed task diffs, certificate JSONs, coverage traces, mutation results, PEAR labels, human-audit records | Public |
| Trajectories | Full agent trajectories for all open-source agents (JSONL) | Public |
| Regeneration-Kit | Transformation builders, checkers, Dockerfiles, pinned dependency manifests | Public |
| Leaderboard-Kit | Patch-submission schema and scoring command | Public |
| Lean Proofs | Metric self-consistency formalization (3,790 LOC, Lean 4) | Public |
The Lean development verifies internal consistency of analyzer and checker definitions:
| Property | Status |
|---|---|
| SemOK => Pass (semantic implies test pass) | Verified |
| Sens_sem and stale are non-overlapping and additive | Verified |
| L0 -> L5 monotonicity (certificate levels) | Verified |
| MAI_w in [0,1] (weighted adequacy index bounded) | Verified |
| Distractor edit does not imply SemOK (counterexample exclusion) | Verified |
| Analyzer determinism (fixed inputs => unique output) | Verified |
Scope: definition-level consistency only. Does not prove properties about agent behavior or program correctness.
# Check Lean proofs
lake build@inproceedings{metaprobe2027, title = {MetaProbe: Auditable Metamorphic Evaluation of Causal Consistency in Repository-Level Coding Agents}, author = {Anonymous}, booktitle = {Proceedings of the 49th International Conference on Software Engineering (ICSE)}, year = {2027}, note = {Under review} }
This repository is released for academic research purposes. See LICENSE for details.
We thank the annotators who participated in blind human adjudication and the external users who completed the artifact usability study. All experiments were conducted on institutional GPU clusters.