Pipeline Plan 184

Seth Ford edited this page Mar 10, 2026 · 2 revisions

Implementation Plan: Failure Root Cause Classifier with Automated Platform Issue Creation

Issue: #184 Branch: feat/-failure-root-cause-classifier-with-auto-184 Complexity: Standard Estimated files: 10 modified, 2 new

Brainstorming & Design Decisions

Requirements Clarity

Minimum viable change: Build a root cause classifier library (scripts/lib/root-cause.sh) that categorizes pipeline failures into systematic types (platform_bug, code_bug, config_error, infra_issue, rate_limit, context_exhaustion, external_dep), wire it into the daemon's failure handling path, add historical pattern learning, expose a CLI command, add dashboard visualization, and auto-create GitHub issues for platform bugs.

Implicit requirements:

The classifier must integrate seamlessly with the existing daemon_on_failure() flow in scripts/lib/daemon-failure.sh
Historical learning must feed back into classification confidence (not just regex matching)
Dashboard needs both an API endpoint and frontend component
CLI command needed for standalone shipwright root-cause usage
Must respect NO_GITHUB environment variable for local/offline mode
Must follow Bash 3.2 compatibility rules

Acceptance criteria (from issue):

Failure classifier analyzes error-log.jsonl and categorizes root cause (platform/user/env/config)
Decision tree trained on historical failure patterns from events.jsonl
Platform bugs trigger automatic GitHub issue creation with error context, affected stages, and reproduction steps
Dashboard shows failure breakdown by category
Reduce repeat platform failures by >30% within 2 weeks of deployment (measurement criterion)

Design Alternatives Considered

Alternative 1: Simple regex classifier (CHOSEN)

Pattern-match error messages against known signatures per category
Historical boosting adjusts confidence based on past classifications
Trade-offs: Simple, fast, Bash-native, easy to extend. Less sophisticated than ML but perfectly adequate for structured error messages. Minimal blast radius — one new library file + integration points.

Alternative 2: LLM-based classifier

Send error messages to Claude for classification
Trade-offs: More accurate on ambiguous errors, but adds API cost per failure, requires Claude CLI availability, slower, and creates a circular dependency (classifier needs Claude, but Claude failures are what we're classifying). Rejected for operational reliability.

Alternative 3: SQLite-based decision tree

Build a proper decision tree in the SQLite layer using historical data
Trade-offs: More sophisticated learning, but significantly more complex, requires the sw-db.sh layer, and the JSONL-based history approach is simpler and sufficient. Over-engineering for the current scale.

Risk Assessment

Risk	Mitigation
False positive platform bug issues spam the repo	Confidence threshold (>70%) + deduplication via error signature (cksum)
Classifier misidentifies user code bugs as platform bugs	Conservative regex patterns that require Shipwright-specific markers (sw-*.sh, shipwright, pipeline-state)
Historical data poisoned by misclassifications	Confidence boosting is bounded (+10 max, -5 for disagreement), so bad data self-corrects over time
`rootcause_main()` failure crashes the daemon	All classifier calls wrapped in `
Large error-log.jsonl causes slow analysis	Analysis capped at last 50 entries; history analysis returns top 50 patterns

Architecture

Component Diagram

 ┌──────────────────────────┐
 │ Pipeline Failure │
 │ (exit code != 0) │
 └──────────┬───────────────┘
 │
 ▼
 ┌──────────────────────────┐ ┌──────────────────────────┐
 │ PostToolUse Hook │────▶│ error-log.jsonl │
 │ (.claude/hooks/) │ │ (per-tool error capture)│
 └──────────────────────────┘ └──────────────────────────┘
 │
 ▼
 ┌──────────────────────────┐ ┌──────────────────────────┐
 │ daemon_on_failure() │────▶│ rootcause_main() │
 │ (lib/daemon-failure.sh) │ │ (lib/root-cause.sh) │
 └──────────────────────────┘ └──────────┬───────────────┘
 │
 ┌─────────────────────┼───────────────────┐
 ▼ ▼ ▼
 ┌────────────────┐ ┌──────────────────┐ ┌────────────────┐
 │ rootcause_ │ │ rootcause_ │ │ rootcause_ │
 │ classify() │ │ suggest_fix() │ │ learn() │
 │ + history boost │ │ │ │ │
 └───────┬────────┘ └──────────────────┘ └───────┬────────┘
 │ │
 ▼ ▼
 ┌────────────────┐ ┌───────────────────┐
 │ rootcause_ │ │ root-causes.jsonl │
 │ create_ │ │ (learning data) │
 │ platform_issue │ └───────────────────┘
 └───────┬────────┘
 │
 ▼
 ┌────────────────┐
 │ GitHub Issue │
 │ (auto-created) │
 └────────────────┘
Dashboard Layer:
┌──────────────────────────────────────────────────────────────┐
│ GET /api/root-cause/breakdown │
│ → reads root-causes.jsonl │
│ → returns category distribution + avg confidence │
│ → consumed by insights.ts frontend view │
└──────────────────────────────────────────────────────────────┘

Interface Contracts

// Classification output (from rootcause_classify)
interface Classification {
 category: "rate_limit" | "context_exhaustion" | "infra_issue" | "platform_bug" | "config_error" | "external_dep" | "code_bug";
 confidence: number; // 0-99
 evidence: string[];
 suggested_action: string;
}
// Fix suggestion output (from rootcause_suggest_fix)
interface FixSuggestion {
 category: string;
 suggestions: string;
 actionability: number; // 0-100
}
// Main result (from rootcause_main)
interface RootCauseResult {
 classification: Classification;
 fix: FixSuggestion;
}
// Learning entry (root-causes.jsonl line)
interface LearningEntry {
 category: string;
 confidence: number;
 message: string; // first 200 chars
 recorded_at: string; // ISO 8601
}
// Dashboard API response (GET /api/root-cause/breakdown)
interface RootCauseBreakdown {
 breakdown: Array<{
 category: string;
 count: number;
 percentage: number;
 avg_confidence: number;
 }>;
 total: number;
 period: number; // days
}

Error Boundaries

Classifier errors → caught by || true in daemon integration, failure continues without classification
Learning write errors → caught silently, classification still returned
GitHub issue creation errors → caught by || true, logged but non-blocking
Dashboard API errors → returns empty breakdown {breakdown: [], total: 0, period: N}
Malformed JSONL data → jq handles gracefully with 2>/dev/null fallback

Data Flow

Error occurs → PostToolUse hook captures to error-log.jsonl
 → daemon_on_failure() extracts error snippet (last 100 lines)
 → rootcause_classify() matches against 7 category patterns
 → rootcause_boost_from_history() adjusts confidence ±10 from learning data
 → rootcause_suggest_fix() generates actionable suggestions
 → rootcause_learn() appends to root-causes.jsonl (atomic write)
 → rootcause_create_platform_issue() files GitHub issue if platform_bug/config_error >70%
 → daemon enriches GitHub retry/failure comments with root cause section
 → dashboard reads root-causes.jsonl for /api/root-cause/breakdown

Files to Modify

New Files

File	Purpose
`scripts/lib/root-cause.sh`	Core classifier library (499 lines)
`scripts/sw-root-cause.sh`	CLI entry point (197 lines)
`scripts/sw-root-cause-test.sh`	Test suite (374 lines, 53 tests)

Modified Files

File	Changes
`scripts/lib/daemon-failure.sh`	Wire `rootcause_main()` into `daemon_on_failure()`, enrich retry/failure comments
`scripts/sw`	Add `root-cause` dispatch to CLI router
`scripts/sw-pipeline.sh`	Source root-cause library
`scripts/lib/pipeline-cli.sh`	Source root-cause library
`scripts/lib/pipeline-commands.sh`	Source root-cause library
`dashboard/server.ts`	Add `GET /api/root-cause/breakdown` endpoint
`dashboard/src/types/api.ts`	Add `RootCauseBreakdown` TypeScript interface
`dashboard/src/core/api.ts`	Add `fetchRootCauseBreakdown()` client function
`dashboard/src/views/insights.ts`	Add failure breakdown visualization
`scripts/sw-server-api-test.sh`	Add test for breakdown API endpoint

Implementation Steps

Step 1: Core Classifier Library (`scripts/lib/root-cause.sh`)

Create the root cause classification library with these functions:

rootcause_classify(error_message, stage, exit_code) — Pattern-match error messages against 7 categories using cascading regex checks. Each category has distinct patterns:
- rate_limit: 429, rate limit, throttled, quota
- context_exhaustion: context window, token limit, auto-compact
- infra_issue: timeout, OOM, disk full, network, socket
- platform_bug: sw-*.sh, shipwright, unbound variable
- config_error: missing config, invalid json, bad template
- external_dep: npm ERR, pip install, cargo error
- code_bug: AssertionError, SyntaxError, test fail (default fallback at 45% confidence)
rootcause_boost_from_history(error_message, category, confidence) — Search ~/.shipwright/optimization/root-causes.jsonl for similar past classifications (first 100 chars match). Boost +2 per agreement (max +10, cap 99). Penalize -5 for disagreement (floor 10).
rootcause_analyze_history() — Aggregate learning data into frequency map grouped by message prefix, returning top 50 patterns with count >= 2.
rootcause_analyze_error_log() — Read last 50 entries from error-log.jsonl, classify each, group by category.
rootcause_suggest_fix(category) — Return category-specific fix suggestions with actionability scores (50-90).
rootcause_learn(category, confidence, message) — Atomic append to root-causes.jsonl with proper escaping via jq --arg.
rootcause_create_platform_issue(classification_json, error_message, stage) — Create GitHub issue for platform_bug/config_error with >70% confidence. Deduplicates via error signature (cksum). Respects NO_GITHUB.
rootcause_report() — Generate markdown report: category distribution, top 5 frequent causes, platform bug trend (24h vs 7d), avg confidence by category.
rootcause_main(error_message, stage, exit_code) — Orchestrate: classify → suggest → learn → create issue → return JSON.

Step 2: Daemon Integration (`scripts/lib/daemon-failure.sh`)

Source root-cause.sh at module load
In daemon_on_failure(): call rootcause_main() with last 100 lines of issue log
Extract rc_category, rc_confidence, rc_suggestions from JSON result
Emit daemon.root_cause event with category, confidence, issue number
Build rc_section for GitHub retry comments (collapsible details with root cause + suggestions)
Build rc_final_section for final failure comments

Step 3: CLI Entry Point (`scripts/sw-root-cause.sh`)

Commands:

shipwright root-cause classify <message> — Classify an error message (supports stdin)
- --stage <stage> — Stage context
- --exit-code <code> — Exit code context
- --json — JSON output
shipwright root-cause analyze — Analyze error-log.jsonl patterns
shipwright root-cause report — Generate full report from learning history
shipwright root-cause history [limit] — Show recent classifications (default 20)

Step 4: CLI Router (`scripts/sw`)

Add root-cause) case to the main dispatcher.

Step 5: Dashboard API (`dashboard/server.ts`)

Add GET /api/root-cause/breakdown endpoint:

Query param: period (default 30 days)
Read ~/.shipwright/optimization/root-causes.jsonl
Filter by recorded_at >= cutoff
Group by category, count, avg confidence
Return sorted breakdown with percentages

Step 6: Dashboard Frontend

dashboard/src/types/api.ts: Add RootCauseBreakdown interface
dashboard/src/core/api.ts: Add fetchRootCauseBreakdown() function
dashboard/src/views/insights.ts: Add failure breakdown visualization with category bars and confidence indicators

Step 7: Pipeline Source Integration

Ensure root-cause.sh is sourced in sw-pipeline.sh, lib/pipeline-cli.sh, lib/pipeline-commands.sh so the classifier is available throughout the pipeline.

Step 8: Tests

Write comprehensive tests in scripts/sw-root-cause-test.sh:

Classification tests (9): Each category + fallback + empty input
Error log analysis (2): With/without error-log.jsonl
Fix suggestions (6): Each category returns appropriate suggestions
Learning system (3): Write, accumulate, verify JSONL format
Platform issue creation (3): NO_GITHUB skip, confidence threshold, category filter
Report generation (2): Category distribution, trend analysis
Historical pattern analysis (7): Boost, penalty, empty state, multiple matches
Integration tests (2): Full classify→learn→boost workflow
CLI entry points (9): Each subcommand + help + version
Daemon integration (6): Verify daemon-failure.sh calls rootcause functions

Add dashboard API test to scripts/sw-server-api-test.sh.

Task Checklist

Task 1: Create scripts/lib/root-cause.sh with classifier, learning, issue creation functions
Task 2: Add rootcause_analyze_history() and rootcause_boost_from_history() for historical learning
Task 3: Wire root cause classifier into daemon_on_failure() in scripts/lib/daemon-failure.sh
Task 4: Enhance daemon retry/failure GitHub comments with root cause classification sections
Task 5: Create scripts/sw-root-cause.sh CLI entry point with classify/analyze/report/history
Task 6: Add root-cause dispatch to scripts/sw CLI router
Task 7: Source root-cause.sh in pipeline scripts (sw-pipeline.sh, pipeline-cli.sh, pipeline-commands.sh)
Task 8: Add GET /api/root-cause/breakdown endpoint to dashboard/server.ts
Task 9: Add RootCauseBreakdown TypeScript interface and API client function
Task 10: Add failure breakdown visualization to dashboard/src/views/insights.ts
Task 11: Write test suite scripts/sw-root-cause-test.sh (~53 tests across 10 test groups)
Task 12: Add dashboard API test for /api/root-cause/breakdown
Task 13: Run sw-root-cause-test.sh, sw-lib-daemon-failure-test.sh, and sw-server-api-test.sh
Task 14: Run dashboard vitest suite to verify TypeScript changes

Testing Approach

Test Pyramid Breakdown

Unit tests (40): Classification per category (9), fix suggestions (6), learning system (3), history analysis (7), error log analysis (2), report generation (2), platform issue filters (3), CLI entry points (8)
Integration tests (8): Full classify→learn→boost workflow (2), daemon integration (6)
Dashboard tests (5): API endpoint response shape (1), Vitest TypeScript compilation (4 — types, API client, view rendering)

Coverage Targets

Classification accuracy: 100% of 7 categories + fallback tested
Historical learning: boost, penalty, empty state, multi-match scenarios
Platform issue creation: all filter conditions (NO_GITHUB, confidence < 70, wrong category)
Daemon integration: verify rootcause_main called, events emitted, comments enriched

Critical Paths to Test

Happy path: Error message → correct classification → learning → issue creation (for platform bugs)
Error cases: Empty message → unknown/0 confidence, missing history file → no boost, gh CLI unavailable → skip issue
Edge cases: Very long error messages (truncated to 200 chars in learning), confidence boundary (exactly 70 → skip issue creation since threshold is >70), multiple categories matching (first match wins)

Data Flow Diagram

error-log.jsonl ──► rootcause_analyze_error_log() ──► {patterns_analyzed, top_categories}
 │
 ▼ (failure point: malformed JSONL)
 jq parse error → empty result
daemon log tail ──► rootcause_main() ──► rootcause_classify() ──► {category, confidence}
 │ │
 │ ▼ (failure point: no regex match)
 │ default: code_bug @ 45%
 │ │
 ├──► rootcause_boost_from_history() ──► adjusted confidence
 │ ▼ (failure point: missing JSONL)
 │ pass-through original confidence
 │
 ├──► rootcause_learn() ──► root-causes.jsonl (append)
 │ ▼ (failure point: write permission)
 │ skip learning, continue
 │
 ├──► rootcause_create_platform_issue() ──► GitHub issue
 │ ▼ (failure point: gh CLI / NO_GITHUB)
 │ skip issue creation, continue
 │
 └──► return {classification, fix} JSON

Idempotency Strategy

Learning entries: Append-only JSONL with timestamps. Duplicate entries are harmless — historical analysis uses frequency counting which handles duplicates naturally.
GitHub issue creation: Deduplication via error signature (cksum). Before creating, searches for existing open issues with matching signature. If found, returns existing issue URL instead of creating duplicate.
Classifier invocation: Pure function (given same input → same output, modulo history boosting). Safe to retry.

Rollback Plan

Remove the source "$SCRIPT_DIR/lib/root-cause.sh" lines from daemon-failure.sh, pipeline scripts
Remove rootcause_main call block from daemon_on_failure()
Remove rc_section and rc_final_section from GitHub comment templates
Remove root-cause) case from CLI router
Remove /api/root-cause/breakdown endpoint from dashboard/server.ts
Remove TypeScript types and API client additions
Remove insights view additions
Data files (root-causes.jsonl, error-log.jsonl) are append-only and can be left in place

No schema migrations needed — all persistence is via append-only JSONL files.

Endpoint Specification

GET /api/root-cause/breakdown

Path: /api/root-cause/breakdown
Query params: period (integer, days, default: 30)
Request body: None

Response (200):

{
 "breakdown": [
 {"category": "code_bug", "count": 42, "percentage": 35, "avg_confidence": 78},
 {"category": "platform_bug", "count": 12, "percentage": 10, "avg_confidence": 82}
 ],
 "total": 120,
 "period": 30
}

Response (200, empty): {"breakdown": [], "total": 0, "period": 30} (no data file or no entries in period)
Error codes: None — endpoint always returns 200 with empty data on error

Error Codes (CLI)

Exit Code	Meaning
0	Success
1	Unknown command, missing argument, or library not loaded

Rate Limiting

Not applicable — this is a local CLI tool and dashboard endpoint, not a public API.

Versioning

All scripts carry VERSION="3.2.4" matching the project version. No API versioning needed — the endpoint is internal to the dashboard.

Definition of Done

Root cause classifier library exists with 7 categories and confidence scoring
Historical pattern learning adjusts confidence based on past classifications
Daemon calls classifier on every pipeline failure and emits events
GitHub retry/failure comments include root cause analysis section
Platform bugs with >70% confidence auto-create GitHub issues (respects NO_GITHUB)
shipwright root-cause CLI works with classify/analyze/report/history subcommands
Dashboard API returns failure breakdown by category with period filtering
Dashboard frontend shows failure breakdown visualization
53+ tests pass across root-cause, daemon-failure, and dashboard test suites
No regressions in existing test suites (pipeline, e2e-smoke, dashboard vitest)
PR created and ready for review

Current Implementation Status

This feature has been through 6 build iterations on branch feat/-failure-root-cause-classifier-with-auto-184. All acceptance criteria are met and all tests pass:

Criterion	Status	Evidence
Classifier categorizes root cause	DONE	`scripts/lib/root-cause.sh` — 7 categories with confidence scoring
Decision tree from historical patterns	DONE	`rootcause_boost_from_history()` + `rootcause_analyze_history()`
Platform bugs auto-create GitHub issues	DONE	`rootcause_create_platform_issue()` with dedup + confidence gate
Dashboard shows failure breakdown	DONE	`/api/root-cause/breakdown` + `insights.ts` visualization
All tests passing	DONE	53/53 root-cause, 34/34 daemon-failure, 49/49 API, 284/284 vitest, 19/19 e2e-smoke

Remaining: PR creation and merge.

Pipeline Plan 184

Implementation Plan: Failure Root Cause Classifier with Automated Platform Issue Creation

Brainstorming & Design Decisions

Requirements Clarity

Design Alternatives Considered

Risk Assessment

Architecture

Component Diagram

Interface Contracts

Error Boundaries

Data Flow

Files to Modify

New Files

Modified Files

Implementation Steps

Step 1: Core Classifier Library (scripts/lib/root-cause.sh)

Step 2: Daemon Integration (scripts/lib/daemon-failure.sh)

Step 3: CLI Entry Point (scripts/sw-root-cause.sh)

Step 4: CLI Router (scripts/sw)

Step 5: Dashboard API (dashboard/server.ts)

Step 6: Dashboard Frontend

Step 7: Pipeline Source Integration

Step 8: Tests

Task Checklist

Testing Approach

Test Pyramid Breakdown

Coverage Targets

Critical Paths to Test

Data Flow Diagram

Idempotency Strategy

Rollback Plan

Endpoint Specification

GET /api/root-cause/breakdown

Error Codes (CLI)

Rate Limiting

Versioning

Definition of Done

Current Implementation Status

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Step 1: Core Classifier Library (`scripts/lib/root-cause.sh`)

Step 2: Daemon Integration (`scripts/lib/daemon-failure.sh`)

Step 3: CLI Entry Point (`scripts/sw-root-cause.sh`)

Step 4: CLI Router (`scripts/sw`)

Step 5: Dashboard API (`dashboard/server.ts`)