Durable Execution Pattern
Serialize execution as recoverable checkpoints. Any interruption resumes from the most recent one with results identical to uninterrupted execution. Temporal.io implements this at the code layer, but the same semantics work with a JSON file.
State File Structure
{"workflow_id":"wf-bug-e2e-AE-33995-20260601","workflow_version":"1.3.0","jira_key":"AE-33995","started_at":"2026-06-01T10:00:00+08:00","phase":"phase_4","phases":{"phase_1":{"status":"done","completed_at":"2026-06-01T10:02:30+08:00","output_file":"bug_info.json"},"phase_4":{"status":"in_progress","step":"step_4_1","steps":{"step_4_1":{"status":"done","output_file":"candidate_a.json"},"step_4_2":{"status":"in_progress"},"step_4_3":{"status":"pending"}}}}}
Resume Protocol
def resume_workflow(state_file: Path) -> None:
state = json.loads(state_file.read_text())
for phase_id, phase_data in state["phases"].items():
if phase_data["status"] == "done":
continue # skip completed phases
if phase_data["status"] in ("in_progress", "pending"):
# in_progress treated same as pending — re-execute
# (idempotency guarantees this is safe)
execute_phase(phase_id, state)
return
The key principle: trust only the state file, not memory. The main Agent doesn't remember what it did — it reads the status field. Phases marked in_progress get re-executed, which requires every phase operation to be idempotent.
Double-Ended Writes
Write to the state file both before a phase starts and after it completes — not only on completion:
def execute_phase(phase_id: str, state: dict) -> None:
# Before start: mark in_progress
# (if crash occurs, resume finds this phase and re-executes it)
state["phases"][phase_id]["status"] = "in_progress"
write_state(state)
try:
result = run_phase_logic(phase_id, state)
# After completion: mark done, record output file path
state["phases"][phase_id]["status"] = "done"
state["phases"][phase_id]["output_file"] = result.output_file
write_state(state)
except Exception as e:
state["phases"][phase_id]["status"] = "failed"
state["phases"][phase_id]["error"] = str(e)
write_state(state)
raise
Idempotency Design
The resume protocol re-executes in_progress phases, meaning a phase can run twice. Operations that aren't idempotent produce duplicate side effects: two Jira comments, two git commits, two notification emails.
Idempotency Analysis by Operation Type
File writes (naturally idempotent)
# Overwrite is idempotent — running twice produces the same result
output_file.write_text(json.dumps(result)) # ✅
Jira comments (not idempotent — requires detection)
# ❌ Wrong: direct write produces a duplicate comment on re-run
jira.add_comment(issue_key, comment_text)
# ✅ Correct: check for existing comment with this run's ID first
def add_comment_idempotent(issue_key: str, comment_text: str, run_id: str) -> None:
existing = jira.get_comments(issue_key)
marker = f"[run_id:{run_id}]" # unique marker per workflow run
if any(marker in c.body for c in existing):
return # already written — skip
jira.add_comment(issue_key, f"{marker}\n{comment_text}")
Git commits (not idempotent — requires detection)
# ❌ Wrong: direct commit creates a second commit on re-run
git.commit(message)
# ✅ Correct: check if commit result file exists and passed=true
def commit_idempotent(message: str, output_file: Path) -> dict:
if output_file.exists():
result = json.loads(output_file.read_text())
if result.get("passed"):
return result # already committed successfully
commit_sha = git.commit(message)
result = {"passed": True, "sha": commit_sha}
output_file.write_text(json.dumps(result))
return result
External API triggers (conditionally idempotent)
# Adding a Gerrit reviewer: duplicate adds don't error — naturally idempotent ✅
gerrit.add_reviewer(change_id, reviewer)
# Creating a cron job: duplicate creates produce two jobs ❌
# Fix: list first, create only if not already present
def create_cron_idempotent(job_config: dict) -> None:
existing_jobs = cron.list_jobs()
if any(j["name"] == job_config["name"] for j in existing_jobs):
return # already exists — skip
cron.create_job(job_config)
Idempotency Self-Check
For every new Step, answer these three questions before implementing:
□しろいしかく If this step runs twice, does it produce side effects?
□しろいしかく If yes, how do you detect "already executed" and skip?
□しろいしかく Is the detection logic itself idempotent?
The third question is easy to miss. If detection depends on in-memory state or has side effects of its own, it fails in the resume scenario just like the original operation.
State File Version Binding
Modify a workflow definition mid-run — add a new Step, for example — and the old state file has no record of it. When the workflow resumes, the main Agent has no basis for handling the missing step.
The fix: bind the workflow version in the state file and verify it on resume.
def start_or_resume(state_file: Path, current_version: str) -> dict:
if state_file.exists():
state = json.loads(state_file.read_text())
saved_version = state.get("workflow_version")
if saved_version != current_version:
raise WorkflowVersionMismatch(
f"State file version: {saved_version}\n"
f"Current workflow version: {current_version}\n"
f"Options:\n"
f" 1. Resume with saved state using old workflow ({saved_version})\n"
f" 2. Start fresh with new workflow ({current_version})\n"
f" 3. Manually migrate the state file"
)
return state # versions match — resume normally
# New run: create state file
state = {
"workflow_version": current_version,
"started_at": datetime.now(timezone.utc).isoformat(),
"phases": {}
}
write_state(state, state_file)
return state
Version Number Rules (MAJOR.MINOR.PATCH)
MAJOR: Phase structure changes (add/remove Phase, major routing changes)
→ Breaks in-progress runs; requires explicit handling
→ Cannot resume directly; user must decide
MINOR: Add Step, template improvements, new gate options
→ Backward compatible; in-progress runs complete with old version
→ New runs use new version
PATCH: Wording tweaks, config adjustments, behavior unchanged
→ Safe to upgrade; old state files resume without issue
Design Checklist
State persistence
- [ ] Every Phase/Step writes
in_progress before starting and done after completing
- [ ] Resume protocol reads only the state file, not conversation history
- [ ] State file includes
workflow_version
Idempotency
- [ ] All external writes (Jira comments, git commits, API calls) have idempotency checks
- [ ] Detection uses a unique identifier (run_id or output file existence)
- [ ] The detection logic itself produces no side effects
Version binding
- [ ] Version is verified on resume against the current workflow version
- [ ] MAJOR version changes have an explicit handling strategy
- [ ] Version mismatches surface user-actionable options, not just an error exit
Summary
-
Durable Execution requires double-ended writes: write
in_progress before the phase starts and done after — a crash at any point allows precise resumption
-
Resume requires idempotency:
in_progress phases get re-executed, so every external write must be safe to run twice; file writes are naturally idempotent, Jira comments and git commits need explicit detection
-
Version binding prevents silent errors: when a workflow is modified, a mismatch between the old state file and the new workflow version should surface actionable options — not silently apply new logic to old state
Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.
Find more useful knowledge and interesting products on my Homepage