Test-driven prompt engineering for GitHub Copilot.
Everyone copies instruction files from blog posts, adds "you are a senior engineer" to agent configs, and includes skills found on Reddit. But does any of it work? Are your instructions making your agent better — or just longer?
You don't know, because you're not testing it.
pytest-codingagents gives you a complete test→optimize→test loop for GitHub Copilot configurations:
- Write a test — define what the agent should do
- Run it — see it fail (or pass)
- Optimize — call
optimize_instruction()to get a concrete suggestion - A/B confirm — use
ab_runto prove the change actually helps - Ship it — you now have evidence, not vibes
Currently supports GitHub Copilot via copilot-sdk with IDE personas for VS Code, Claude Code, and Copilot CLI environments.
from pytest_codingagents import CopilotAgent, optimize_instruction import pytest async def test_docstring_instruction_works(ab_run): """Prove the docstring instruction actually changes output, and get a fix if it doesn't.""" baseline = CopilotAgent(instructions="Write Python code.") treatment = CopilotAgent( instructions="Write Python code. Add Google-style docstrings to every function." ) b, t = await ab_run(baseline, treatment, "Create math.py with add(a, b) and subtract(a, b).") assert b.success and t.success if '"""' not in t.file("math.py"): suggestion = await optimize_instruction( treatment.instructions or "", t, "Agent should add docstrings to every function.", ) pytest.fail(f"Docstring instruction was ignored.\n\n{suggestion}") assert '"""' not in b.file("math.py"), "Baseline should not have docstrings"
uv add pytest-codingagents
Authenticate via GITHUB_TOKEN env var (CI) or gh auth status (local).
| Capability | What it proves | Guide |
|---|---|---|
| A/B comparison | Config B actually produces different (and better) output than Config A | Getting Started |
| Instruction optimization | Turn a failing test into a ready-to-use instruction fix | Optimize Instructions |
| Instructions | Your custom instructions change agent behavior — not just vibes | Getting Started |
| Skills | That domain knowledge file is helping, not being ignored | Skill Testing |
| Models | Which model works best for your use case and budget | Model Comparison |
| Custom Agents | Your custom agent configurations actually work as intended | Getting Started |
| MCP Servers | The agent discovers and uses your custom tools | MCP Server Testing |
| CLI Tools | The agent operates command-line interfaces correctly | CLI Tool Testing |
See it in action: Basic Report · Model Comparison · Instruction Testing
Every test run produces an HTML report with AI-powered insights:
- Diagnoses failures — root cause analysis with suggested fixes
- Compares models — leaderboards ranked by pass rate and cost
- Evaluates instructions — which instructions produce better results
- Recommends improvements — actionable changes to tools, instructions, and skills
uv run pytest tests/ --aitest-html=report.html --aitest-summary-model=azure/gpt-5.2-chat
Full docs at sbroenne.github.io/pytest-codingagents — API reference, how-to guides, and demo reports.
MIT