Three failures that typically reach runtime undetected:
- Add an output field in
templates/analyze.md, forget to declare it in workflow.md's context_inputs. The downstream phase receives nothing and silently continues.
- Change the routing confidence threshold from 0.95 to 0.9, skip updating the routing tests. Edge case behavior shifts; you find out when a workflow runs in production.
- Delete a template file with an active reference in
workflow.md.
All three are catchable at commit time with automated checks, not at runtime.
Three CI Gates
Gate 1: Static validation (seconds, runs on every commit)
- All referenced template files exist
- Skills in config.yaml exist in the registry
- Every phase's on_success / on_failure target is a known phase or reserved keyword
Gate 2: Schema tests (minutes, runs on every commit)
- context_inputs declarations align with actual upstream output fields
- No real LLM calls — validates data contracts only
- Corresponds to Layer 1 + Layer 2 tests from the Evaluation article (W5)
Gate 3: End-to-end regression (hours, runs before merge)
- Run eval/cases.yaml happy path through the full workflow
- Compare results against baseline metrics
- Corresponds to Layer 3 tests from W5
Gate 1: Static Validation Script
Gate 1 doesn't call LLM. Pure filesystem checks, completes in seconds:
#!/usr/bin/env python3
# tools/validate_workflow.py
import sys
import re
import yaml
from pathlib import Path
SKILL_DIR = Path("skills/wf-bug-e2e")
TEMPLATES_DIR = SKILL_DIR / "templates"
ERRORS = []
def check_template_references():
"""All templates referenced in workflow.md must exist on disk"""
content = (SKILL_DIR / "workflow.md").read_text()
refs = re.findall(r"template:\s*(\S+\.md)", content)
for ref in refs:
if not (TEMPLATES_DIR / ref).exists():
ERRORS.append(f"Template not found: templates/{ref} (referenced in workflow.md)")
def check_phase_routing():
"""Every on_success / on_failure target must be a known phase or reserved keyword"""
content = (SKILL_DIR / "workflow.md").read_text()
phases = set(re.findall(r"^phase_(\w+):", content, re.MULTILINE))
targets = re.findall(r"(?:on_success|on_failure|continue_to):\s*(\S+)", content)
reserved = {"END", "human_escalation", "gate_A", "gate_B", "gate_C"}
for target in targets:
phase_name = target.replace("phase_", "")
if target not in reserved and phase_name not in phases:
ERRORS.append(f"Routing target not found: '{target}'")
def check_config_skills():
"""Skills referenced in config.yaml must exist in the registry"""
config_file = SKILL_DIR / "config.yaml"
registry_file = Path("skills/registry.yaml")
if not config_file.exists() or not registry_file.exists():
return
config = yaml.safe_load(config_file.read_text())
registry = yaml.safe_load(registry_file.read_text())
registered_ids = {s["id"] for s in registry.get("skills", [])}
for phase_config in config.get("phases", {}).values():
skill_id = phase_config.get("skill")
if skill_id and skill_id not in registered_ids:
ERRORS.append(f"Skill not in registry: '{skill_id}' (check config.yaml)")
def main():
check_template_references()
check_phase_routing()
check_config_skills()
if ERRORS:
print("❌ Workflow validation failed:")
for e in ERRORS:
print(f" - {e}")
sys.exit(1)
print("✅ Workflow validation passed")
if __name__ == "__main__":
main()
Wired into CI (GitHub Actions):
# .github/workflows/workflow-ci.yml
name: Workflow CI
on: [push, pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install pyyaml
- name: Gate 1 — Static validation
run: python tools/validate_workflow.py
schema-tests:
runs-on: ubuntu-latest
needs: validate
steps:
- uses: actions/checkout@v4
- run: pip install pytest
- name: Gate 2 — Schema tests
run: pytest tests/unit/ tests/integration/ -v
Gate 2: Data Contract Verification
Gate 2 verifies that every Phase's declared context_inputs aligns with the actual output fields from upstream phases.
# tests/integration/test_context_alignment.py
import yaml, json, re
from pathlib import Path
def load_context_inputs(phase_id: str) -> list[str]:
config = yaml.safe_load(Path("skills/wf-bug-e2e/config.yaml").read_text())
return config["phases"][phase_id].get("context_inputs", [])
def load_output_fields(phase_id: str) -> set[str]:
template = Path(f"skills/wf-bug-e2e/templates/{phase_id}.md").read_text()
schema_match = re.search(r"```
json\n({.*?})\n
```", template, re.DOTALL)
if schema_match:
return set(json.loads(schema_match.group(1)).keys())
return set()
def test_phase3_context_alignment():
phase3_inputs = load_context_inputs("phase_3")
phase1_outputs = load_output_fields("phase_1")
for input_decl in phase3_inputs:
if input_decl.startswith("phases.phase1."):
field = input_decl.replace("phases.phase1.", "")
assert field in phase1_outputs, \
f"Phase 3 needs '{field}' but Phase 1 output schema doesn't include it"
Version Number Rules
Workflow files are code. Every change deserves a version.
MAJOR.MINOR.PATCH
MAJOR: Phase structure changes
- Adding or removing a Phase
- Major routing logic changes (affects main pipeline conditions)
- Breaking changes to a subagent output schema
→ Risk of breaking in-progress workflow runs
→ Resume protocol must check version compatibility (see W3)
MINOR: Additive changes, backward compatible
- Adding a Step inside an existing Phase
- Adding gate options
- Template improvements (no field changes)
→ In-progress runs complete with the old version
→ New triggers use the new version
PATCH: Wording and configuration adjustments
- Prompt wording improvements
- Timeout adjustments
- Comment changes
→ Safe update; old state files resume without issue
Where version numbers live:
# SKILL.md (workflow entry file)
---
name: wf-bug-e2e
version: 1.3.0 ← update before each release
last_updated: 2026年06月01日
---
//workflow_state.json(boundatruntime,verifiedonresume){"workflow_version":"1.3.0",...}
Release Process
Step 1: Document the reason
Write in CHANGELOG.md: why this change? what changed?
Not "optimized some logic" — write "changed Phase 3 confidence threshold
from 0.95 to 0.90, because historical data showed Gate A triggering at
18%, above the < 20% target"
Step 2: Run Gate 1 + Gate 2
python tools/validate_workflow.py
pytest tests/unit/ tests/integration/
Step 3: (MAJOR only) Run Gate 3
python run_eval.py --cases eval/cases.yaml --output baseline_new.json
python compare_eval.py baseline_current.json baseline_new.json
Step 4: Update version number
Edit SKILL.md version field
Add new version entry to CHANGELOG.md
Step 5: Release
Merge changes; old version enters deprecated status
Document any in-progress workflow runs using the old version
CHANGELOG Template
# CHANGELOG
## v1.3.0 (2026年06月01日)
### Changed
- Phase 3 confidence threshold: 0.95 → 0.90
- Reason: historical Gate A trigger rate reached 18%, above the <20% target
- Impact: ~5% of cases now proceed to Phase 4 instead of triggering Gate A
### Added
- Phase 4 collect-all strategy declared explicitly
- Previous behavior was implicit; now documented as collect-all
## v1.2.1 (2026年05月15日)
### Fixed
- Phase 7 Jira comment idempotency detection
- Problem: inconsistent run_id format caused duplicate comments in some cases
- Fix: standardized run_id format to "wf-{jira_key}-{date}"
Design Checklist
File structure
- [ ] Policy / Workflow / TaskSpec / Tool four-layer separation
- [ ]
config.yaml centralizes mutable parameters (timeouts, retry counts, model selection)
- [ ]
SKILL.md includes a version field
Gate 1 (static validation)
- [ ] All template references exist on disk
- [ ] All routing targets point to known phases or reserved keywords
- [ ] Runs automatically in CI on every commit
Gate 2 (schema tests)
- [ ] context_inputs align with upstream Phase output field tests
- [ ] All routing condition edge cases have test coverage
- [ ] Runs automatically in CI on every commit
Gate 3 (end-to-end regression)
- [ ] Required for MAJOR version changes
- [ ] Results compared against baseline; threshold violations block release
Version management
- [ ] Every release updates SKILL.md version number
- [ ] CHANGELOG documents the reason for changes, not just what changed
Summary
-
Three gates, three speeds: static validation in seconds catches file reference errors, schema tests in minutes catch contract misalignments, end-to-end regression in hours catches behavior regressions — the first two handle most errors at low cost
-
Version numbers distinguish behavior changes from safe updates: MAJOR changed routing or schema, handle in-progress runs; PATCH changed wording, old state files upgrade silently
-
CHANGELOG documents reasons, not actions: "changed threshold from 0.95 to 0.9" is an action; "Gate A was triggering 18% of the time, above the 20% target" is the reason — six months later you only need the reason
Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.
Find more useful knowledge and interesting products on my Homepage