Workflow Series (07): Engineering and Version Management — CI/CD for Workflows

DEV Community

Three failures that typically reach runtime undetected:

Add an output field in templates/analyze.md, forget to declare it in workflow.md's context_inputs. The downstream phase receives nothing and silently continues.
Change the routing confidence threshold from 0.95 to 0.9, skip updating the routing tests. Edge case behavior shifts; you find out when a workflow runs in production.
Delete a template file with an active reference in workflow.md.

All three are catchable at commit time with automated checks, not at runtime.

Three CI Gates

Gate 1: Static validation (seconds, runs on every commit)
 - All referenced template files exist
 - Skills in config.yaml exist in the registry
 - Every phase's on_success / on_failure target is a known phase or reserved keyword
Gate 2: Schema tests (minutes, runs on every commit)
 - context_inputs declarations align with actual upstream output fields
 - No real LLM calls — validates data contracts only
 - Corresponds to Layer 1 + Layer 2 tests from the Evaluation article (W5)
Gate 3: End-to-end regression (hours, runs before merge)
 - Run eval/cases.yaml happy path through the full workflow
 - Compare results against baseline metrics
 - Corresponds to Layer 3 tests from W5

Gate 1: Static Validation Script

Gate 1 doesn't call LLM. Pure filesystem checks, completes in seconds:

#!/usr/bin/env python3
# tools/validate_workflow.py

import sys
import re
import yaml
from pathlib import Path
SKILL_DIR = Path("skills/wf-bug-e2e")
TEMPLATES_DIR = SKILL_DIR / "templates"
ERRORS = []
def check_template_references():
 """All templates referenced in workflow.md must exist on disk"""
 content = (SKILL_DIR / "workflow.md").read_text()
 refs = re.findall(r"template:\s*(\S+\.md)", content)
 for ref in refs:
 if not (TEMPLATES_DIR / ref).exists():
 ERRORS.append(f"Template not found: templates/{ref} (referenced in workflow.md)")
def check_phase_routing():
 """Every on_success / on_failure target must be a known phase or reserved keyword"""
 content = (SKILL_DIR / "workflow.md").read_text()
 phases = set(re.findall(r"^phase_(\w+):", content, re.MULTILINE))
 targets = re.findall(r"(?:on_success|on_failure|continue_to):\s*(\S+)", content)
 reserved = {"END", "human_escalation", "gate_A", "gate_B", "gate_C"}
 for target in targets:
 phase_name = target.replace("phase_", "")
 if target not in reserved and phase_name not in phases:
 ERRORS.append(f"Routing target not found: '{target}'")
def check_config_skills():
 """Skills referenced in config.yaml must exist in the registry"""
 config_file = SKILL_DIR / "config.yaml"
 registry_file = Path("skills/registry.yaml")
 if not config_file.exists() or not registry_file.exists():
 return
 config = yaml.safe_load(config_file.read_text())
 registry = yaml.safe_load(registry_file.read_text())
 registered_ids = {s["id"] for s in registry.get("skills", [])}
 for phase_config in config.get("phases", {}).values():
 skill_id = phase_config.get("skill")
 if skill_id and skill_id not in registered_ids:
 ERRORS.append(f"Skill not in registry: '{skill_id}' (check config.yaml)")
def main():
 check_template_references()
 check_phase_routing()
 check_config_skills()
 if ERRORS:
 print("❌ Workflow validation failed:")
 for e in ERRORS:
 print(f" - {e}")
 sys.exit(1)
 print("✅ Workflow validation passed")
if __name__ == "__main__":
 main()

Wired into CI (GitHub Actions):

# .github/workflows/workflow-ci.yml
name: Workflow CI
on: [push, pull_request]
jobs:
 validate:
 runs-on: ubuntu-latest
 steps:
 - uses: actions/checkout@v4
 - uses: actions/setup-python@v5
 with:
 python-version: "3.12"
 - run: pip install pyyaml
 - name: Gate 1 — Static validation
 run: python tools/validate_workflow.py
 schema-tests:
 runs-on: ubuntu-latest
 needs: validate
 steps:
 - uses: actions/checkout@v4
 - run: pip install pytest
 - name: Gate 2 — Schema tests
 run: pytest tests/unit/ tests/integration/ -v

Gate 2: Data Contract Verification

Gate 2 verifies that every Phase's declared context_inputs aligns with the actual output fields from upstream phases.

# tests/integration/test_context_alignment.py

import yaml, json, re
from pathlib import Path
def load_context_inputs(phase_id: str) -> list[str]:
 config = yaml.safe_load(Path("skills/wf-bug-e2e/config.yaml").read_text())
 return config["phases"][phase_id].get("context_inputs", [])
def load_output_fields(phase_id: str) -> set[str]:
 template = Path(f"skills/wf-bug-e2e/templates/{phase_id}.md").read_text()
 schema_match = re.search(r"```
json\n({.*?})\n
```", template, re.DOTALL)
 if schema_match:
 return set(json.loads(schema_match.group(1)).keys())
 return set()
def test_phase3_context_alignment():
 phase3_inputs = load_context_inputs("phase_3")
 phase1_outputs = load_output_fields("phase_1")
 for input_decl in phase3_inputs:
 if input_decl.startswith("phases.phase1."):
 field = input_decl.replace("phases.phase1.", "")
 assert field in phase1_outputs, \
 f"Phase 3 needs '{field}' but Phase 1 output schema doesn't include it"

Version Number Rules

Workflow files are code. Every change deserves a version.

MAJOR.MINOR.PATCH
MAJOR: Phase structure changes
 - Adding or removing a Phase
 - Major routing logic changes (affects main pipeline conditions)
 - Breaking changes to a subagent output schema
 → Risk of breaking in-progress workflow runs
 → Resume protocol must check version compatibility (see W3)
MINOR: Additive changes, backward compatible
 - Adding a Step inside an existing Phase
 - Adding gate options
 - Template improvements (no field changes)
 → In-progress runs complete with the old version
 → New triggers use the new version
PATCH: Wording and configuration adjustments
 - Prompt wording improvements
 - Timeout adjustments
 - Comment changes
 → Safe update; old state files resume without issue

Where version numbers live:

# SKILL.md (workflow entry file)
---
name: wf-bug-e2e
version: 1.3.0 ← update before each release
last_updated: 2026年06月01日
---

//workflow_state.json(boundatruntime,verifiedonresume){"workflow_version":"1.3.0",...}

Release Process

Step 1: Document the reason
 Write in CHANGELOG.md: why this change? what changed?
 Not "optimized some logic" — write "changed Phase 3 confidence threshold
 from 0.95 to 0.90, because historical data showed Gate A triggering at
 18%, above the < 20% target"
Step 2: Run Gate 1 + Gate 2
 python tools/validate_workflow.py
 pytest tests/unit/ tests/integration/
Step 3: (MAJOR only) Run Gate 3
 python run_eval.py --cases eval/cases.yaml --output baseline_new.json
 python compare_eval.py baseline_current.json baseline_new.json
Step 4: Update version number
 Edit SKILL.md version field
 Add new version entry to CHANGELOG.md
Step 5: Release
 Merge changes; old version enters deprecated status
 Document any in-progress workflow runs using the old version

CHANGELOG Template

# CHANGELOG
## v1.3.0 (2026年06月01日)
### Changed
- Phase 3 confidence threshold: 0.95 → 0.90
 - Reason: historical Gate A trigger rate reached 18%, above the <20% target
 - Impact: ~5% of cases now proceed to Phase 4 instead of triggering Gate A
### Added
- Phase 4 collect-all strategy declared explicitly
 - Previous behavior was implicit; now documented as collect-all
## v1.2.1 (2026年05月15日)
### Fixed
- Phase 7 Jira comment idempotency detection
 - Problem: inconsistent run_id format caused duplicate comments in some cases
 - Fix: standardized run_id format to "wf-{jira_key}-{date}"

Design Checklist

File structure

[ ] Policy / Workflow / TaskSpec / Tool four-layer separation
[ ] config.yaml centralizes mutable parameters (timeouts, retry counts, model selection)
[ ] SKILL.md includes a version field

Gate 1 (static validation)

[ ] All template references exist on disk
[ ] All routing targets point to known phases or reserved keywords
[ ] Runs automatically in CI on every commit

Gate 2 (schema tests)

[ ] context_inputs align with upstream Phase output field tests
[ ] All routing condition edge cases have test coverage
[ ] Runs automatically in CI on every commit

Gate 3 (end-to-end regression)

[ ] Required for MAJOR version changes
[ ] Results compared against baseline; threshold violations block release

Version management

[ ] Every release updates SKILL.md version number
[ ] CHANGELOG documents the reason for changes, not just what changed

Summary

Three gates, three speeds: static validation in seconds catches file reference errors, schema tests in minutes catch contract misalignments, end-to-end regression in hours catches behavior regressions — the first two handle most errors at low cost
Version numbers distinguish behavior changes from safe updates: MAJOR changed routing or schema, handle in-progress runs; PATCH changed wording, old state files upgrade silently
CHANGELOG documents reasons, not actions: "changed threshold from 0.95 to 0.9" is an action; "Gate A was triggering 18% of the time, above the 20% target" is the reason — six months later you only need the reason

Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.

Find more useful knowledge and interesting products on my Homepage