Name	Name	Last commit message	Last commit date
Latest commit History 1 Commit
agents	agents
analysis	analysis
cer	cer
data	data
docs/figures	docs/figures
experiments	experiments
lean	lean
metaprobe	metaprobe
pear	pear
scripts	scripts
swe-bench	swe-bench
tests	tests
transforms	transforms
.gitignore	.gitignore
LICENSE	LICENSE
MetaProbe.lean	MetaProbe.lean
README.md	README.md
hunyuan_api.py	hunyuan_api.py
lake-manifest.json	lake-manifest.json
lakefile.lean	lakefile.lean
lean-toolchain	lean-toolchain
openai_compat_client.py	openai_compat_client.py
qwen_chat_api.py	qwen_chat_api.py
requirements.txt	requirements.txt
resilient_llm_client.py	resilient_llm_client.py

MetaProbe: Auditable Metamorphic Evaluation of Causal Consistency in Repository-Level Coding Agents

MetaProbe Evidence Pipeline

Paper: MetaProbe: Auditable Metamorphic Evaluation of Causal Consistency in Repository-Level Coding Agents
Venue: Submitted to ICSE 2027
Status: Under review (double-blind)

Overview

MetaProbe is an auditable metamorphic evaluation framework that measures evidence-calibrated relational consistency of repository-level coding agents. Unlike pass/fail evaluation on fixed benchmark suites, MetaProbe assesses whether agent repair behavior remains stable under invariant transformations, sensitive under causal relocations, and undistracted by certified irrelevant regions.

The framework introduces:

PEAR (Patch-Equivalence and Alternative-Repair): A semantic adjudication protocol that calibrates the automatic region oracle and bounds oracle error propagation.
CER (Certificate-supported Evidence Region): A multi-signal evidence combiner integrating witness-patch edits, coverage traces, dynamic slices, generated-test behavior, and mutation evidence.
Certificate Checker (L0--L5): A trusted verifier that assigns evidence levels to each transformed task, separating untrusted generation from trusted validation.
Independent Verification Loop: Blind dual-region human adjudication and CER-disjoint discriminative tests validate causal claims without self-referential evidence.

Key Results

We evaluate eight fully reproducible agents on 3,257 certificate-validated tasks across SWE-bench Verified, BugsInPy, and Defects4J.

Agent	Model (Architecture)	Sem. Stability	Sem. Sensitivity	Distractor Error
Simple RAG	Qwen3-7B (7B dense)	48.6%	20.5%	19.7%
SWE-agent + Qwen3	Qwen3-32B (32.8B dense)	55.7%	25.4%	15.3%
Aider + Qwen3	Qwen3-32B (32.8B dense)	57.9%	27.6%	14.1%
OpenHands + GLM-4.6	GLM-4.6 (355B-A32B MoE)	62.1%	33.1%	11.8%
AutoCodeRover + GLM-4.6	GLM-4.6 (355B-A32B MoE)	64.5%	35.0%	10.7%
Agentless + MiniMax-M2	MiniMax-M2 (230B-A10B MoE)	67.6%	37.4%	9.2%
OpenHands + Qwen3	Qwen3-32B (32.8B dense)	72.3%	44.6%	7.1%
OpenHands + DeepSeek-V3.2	DeepSeek-V3.2 (671B-A37B MoE)	74.1%	46.7%	6.5%

Even the strongest agent exhibits a 25.9 pp gap from perfect semantic stability.

Architecture

Relation Semantics and Failure Modes

Figure 2. Geometric intuition behind MetaProbe's predicates: invariant stability, causal sensitivity, and distractor resistance define orthogonal diagnostic dimensions.

MetaProbe follows a generator--checker--adjudicator pipeline:

MRSpec Generator (untrusted)
 --> Certificate Builder (evidence collection)
 --> Certificate Checker (trusted, L0-L5 validation)
 --> Agent Execution (fixed protocol, Docker-isolated)
 --> PEAR Adjudication (oracle calibration)
 --> Independent Verification Loop
 --> Statistical Analyzer (bootstrap + mixed-effects)

Transformation Families

Family	Type	Obligation
Issue paraphrase	Invariant	Repair should remain stable
Semantic refactoring	Invariant	Repair should remain stable
Equivalent test rewrite	Invariant	Repair should remain stable
Distractor insertion	Distractor	Agent should ignore certified-unreachable regions
Guard relocation	Causal	Agent should adapt to relocated fault evidence
API argument relocation	Causal	Agent should adapt to relocated fault evidence

Certificate Yield and Audit Validity

Figure 4. Certificate funnel: 5,400 attempted transformations yield 3,257 L4+ validated tasks. Human audit confirms 94--98% validity across all families.

PEAR Oracle Calibration

PEAR Calibration Matrix

Figure 5. PEAR adjudication across patch categories. Among region-mismatch patches that pass tests, 20.6% are accepted as equivalent or plausible alternative repairs, confirming the necessity of semantic correction beyond naive region matching.

Metamorphic Adequacy Dashboard

Adequacy Dashboard

Figure 3. Full adequacy picture: original pass rate does not determine semantic stability; causal pass leaves a stale-patch gap; certified distractors remain diagnostic.

MetaProbe-Guided Hardening

Hardening Lift Curve

Figure 6. Development-set failures guide scaffold hardening. Gains generalize to held-out repositories (+7.6 pp stability, +9.1 pp sensitivity), held-out transformation families, and cross-dataset transfer, with negligible impact on original pass rate (+0.5 pp).

Repository Structure

metaprobe/ # Core framework
 __init__.py
 mr_spec.py # MRSpec transformation definitions (6 families)
 certificate.py # Certificate checker (L0-L5 evidence validation)
 cer.py # CER: learned evidence-region combiner
 pear.py # PEAR: patch-equivalence adjudication protocol
 metrics.py # Semantic stability/sensitivity/distractor metrics
 generated_tests.py # Generated-test synthesis and disjointness checks
transforms/ # Transformation generators
 paraphrase.py # Issue-preserving paraphrase (invariant)
 refactor.py # Semantic refactoring (invariant)
 test_rewrite.py # Equivalent test rewrite (invariant)
 distractor.py # Distractor insertion (certified unreachable)
 guard_relocate.py # Guard relocation (causal)
 api_relocate.py # API argument relocation (causal)
agents/ # Agent adapters (version-locked scaffolds)
 swe_agent.py # SWE-agent v0.7
 aider.py # Aider v0.82
 openhands.py # OpenHands v0.34
 autocoderover.py # AutoCodeRover v2.1
 agentless.py # Agentless v1.2
 commercial.py # Closed-source API reference (non-essential)
analysis/ # Statistical analysis and visualization
 bootstrap.py # 10,000 repository-clustered bootstrap resamples
 mixed_effects.py # Mixed-effects logistic regression
 baseline_comparison.py # Equal-budget baseline comparison (RQ4)
 cer_ablation.py # CER feature ablation
 cer_robustness.py # CER parameter robustness scans
 hardening.py # MetaProbe-guided hardening experiments
 hardening_loo.py # Leave-one-out hardening ablation
 hardening_heldout.py # Held-out generalization evaluation
 ipw_correction.py # Inverse-propensity weighting for selection bias
 visualization.py # Figure generation scripts
experiments/ # Experiment runners
 run_main.py # Main evaluation pipeline
 run_swebench.py # SWE-bench harness integration
 analyze_results.py # Result aggregation and table generation
lean/ # Lean 4 formalization (3,790 LOC)
 MetaProbe.lean # Metric self-consistency proofs
data/ # Experimental data (sample subset)
scripts/ # Deployment and utility scripts
tests/ # Unit and integration tests
docs/figures/ # Paper figures (PNG)

Quick Start

Prerequisites

pip install -r requirements.txt

Required environment variables for LLM API access:

export HUNYUAN_API_KEY="your-hunyuan-api-key"
export DASHSCOPE_API_KEY="your-dashscope-api-key"

Run in Mock Mode (No GPU Required)

python experiments/run_main.py
python experiments/analyze_results.py

Full Reproduction

# Single-command reproduction of main results table
metaprobe reproduce --table main-agent --seed 0
# Verify a specific certificate
metaprobe verify --task TASK_ID --certificate cert.json
# Score predictions against MetaProbe oracle
metaprobe score --predictions patches.jsonl --metric semantic
# Run stratified human audit
metaprobe audit --sample stratified --output audit.jsonl

GPU Deployment

All experiments use 8x A800 GPUs with vLLM/SGLang serving:

Model	Quantization	GPUs Required
DeepSeek-V3.2 (671B-A37B MoE)	INT8	4--5
GLM-4.6 (355B-A32B MoE)	INT8	4
MiniMax-M2 (230B-A10B MoE)	INT8	3
Qwen3-32B (32.8B dense)	FP16	2

Reproducibility Guarantees

Every result in the paper traces to auditable artifacts:

Model checkpoints: All evaluated agents use publicly downloadable, version-locked open-source checkpoints with exact parameter counts and architecture types. Checkpoint SHA-256 hashes are recorded.
Scaffold versions: Git-commit-pinned (e.g., SWE-agent v0.7 @ commit_hash).
Docker environments: Fixed Docker digest hashes ensure deterministic execution sandboxes.
Random seeds: Temperature fixed at 0 for main tables; seeds 0--4 for robustness analysis.
Trajectory logs: Full agent trajectories stored in results/raw/{agent}/{task_id}.jsonl.

Artifact Components

Component	Description	Status
Open-Lite	Public split, MRSpec files, certificates, scoring scripts	Public on acceptance
Open-Full	All transformed task diffs, certificate JSONs, coverage traces, mutation results, PEAR labels, human-audit records	Public
Trajectories	Full agent trajectories for all open-source agents (JSONL)	Public
Regeneration-Kit	Transformation builders, checkers, Dockerfiles, pinned dependency manifests	Public
Leaderboard-Kit	Patch-submission schema and scoring command	Public
Lean Proofs	Metric self-consistency formalization (3,790 LOC, Lean 4)	Public

Lean 4 Formalization

The Lean development verifies internal consistency of analyzer and checker definitions:

Property	Status
SemOK => Pass (semantic implies test pass)	Verified
Sens_sem and stale are non-overlapping and additive	Verified
L0 -> L5 monotonicity (certificate levels)	Verified
MAI_w in [0,1] (weighted adequacy index bounded)	Verified
Distractor edit does not imply SemOK (counterexample exclusion)	Verified
Analyzer determinism (fixed inputs => unique output)	Verified

Scope: definition-level consistency only. Does not prove properties about agent behavior or program correctness.

# Check Lean proofs
lake build

Citation

@inproceedings{metaprobe2027,
 title = {MetaProbe: Auditable Metamorphic Evaluation of Causal Consistency
 in Repository-Level Coding Agents},
 author = {Anonymous},
 booktitle = {Proceedings of the 49th International Conference on Software Engineering (ICSE)},
 year = {2027},
 note = {Under review}
}

License

This repository is released for academic research purposes. See LICENSE for details.

Acknowledgments

We thank the annotators who participated in blind human adjudication and the external users who completed the artifact usability study. All experiments were conducted on institutional GPU clusters.

Folders and files

Latest commit

History

Repository files navigation

MetaProbe: Auditable Metamorphic Evaluation of Causal Consistency in Repository-Level Coding Agents

Overview

Key Results

Architecture

Transformation Families

Certificate Yield and Audit Validity

PEAR Oracle Calibration

Metamorphic Adequacy Dashboard

MetaProbe-Guided Hardening

Repository Structure

Quick Start

Prerequisites

Run in Mock Mode (No GPU Required)

Full Reproduction

GPU Deployment

Reproducibility Guarantees

Artifact Components

Lean 4 Formalization

Citation

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages