Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Pipeline Plan 184

Seth Ford edited this page Mar 10, 2026 · 2 revisions

Implementation Plan: Failure Root Cause Classifier with Automated Platform Issue Creation

Issue: #184 Branch: feat/-failure-root-cause-classifier-with-auto-184 Complexity: Standard Estimated files: 10 modified, 2 new


Brainstorming & Design Decisions

Requirements Clarity

Minimum viable change: Build a root cause classifier library (scripts/lib/root-cause.sh) that categorizes pipeline failures into systematic types (platform_bug, code_bug, config_error, infra_issue, rate_limit, context_exhaustion, external_dep), wire it into the daemon's failure handling path, add historical pattern learning, expose a CLI command, add dashboard visualization, and auto-create GitHub issues for platform bugs.

Implicit requirements:

  • The classifier must integrate seamlessly with the existing daemon_on_failure() flow in scripts/lib/daemon-failure.sh
  • Historical learning must feed back into classification confidence (not just regex matching)
  • Dashboard needs both an API endpoint and frontend component
  • CLI command needed for standalone shipwright root-cause usage
  • Must respect NO_GITHUB environment variable for local/offline mode
  • Must follow Bash 3.2 compatibility rules

Acceptance criteria (from issue):

  1. Failure classifier analyzes error-log.jsonl and categorizes root cause (platform/user/env/config)
  2. Decision tree trained on historical failure patterns from events.jsonl
  3. Platform bugs trigger automatic GitHub issue creation with error context, affected stages, and reproduction steps
  4. Dashboard shows failure breakdown by category
  5. Reduce repeat platform failures by >30% within 2 weeks of deployment (measurement criterion)

Design Alternatives Considered

Alternative 1: Simple regex classifier (CHOSEN)

  • Pattern-match error messages against known signatures per category
  • Historical boosting adjusts confidence based on past classifications
  • Trade-offs: Simple, fast, Bash-native, easy to extend. Less sophisticated than ML but perfectly adequate for structured error messages. Minimal blast radius — one new library file + integration points.

Alternative 2: LLM-based classifier

  • Send error messages to Claude for classification
  • Trade-offs: More accurate on ambiguous errors, but adds API cost per failure, requires Claude CLI availability, slower, and creates a circular dependency (classifier needs Claude, but Claude failures are what we're classifying). Rejected for operational reliability.

Alternative 3: SQLite-based decision tree

  • Build a proper decision tree in the SQLite layer using historical data
  • Trade-offs: More sophisticated learning, but significantly more complex, requires the sw-db.sh layer, and the JSONL-based history approach is simpler and sufficient. Over-engineering for the current scale.

Risk Assessment

Risk Mitigation
False positive platform bug issues spam the repo Confidence threshold (>70%) + deduplication via error signature (cksum)
Classifier misidentifies user code bugs as platform bugs Conservative regex patterns that require Shipwright-specific markers (sw-*.sh, shipwright, pipeline-state)
Historical data poisoned by misclassifications Confidence boosting is bounded (+10 max, -5 for disagreement), so bad data self-corrects over time
rootcause_main() failure crashes the daemon All classifier calls wrapped in `
Large error-log.jsonl causes slow analysis Analysis capped at last 50 entries; history analysis returns top 50 patterns

Architecture

Component Diagram

 ┌──────────────────────────┐
 │ Pipeline Failure │
 │ (exit code != 0) │
 └──────────┬───────────────┘
 │
 ▼
 ┌──────────────────────────┐ ┌──────────────────────────┐
 │ PostToolUse Hook │────▶│ error-log.jsonl │
 │ (.claude/hooks/) │ │ (per-tool error capture)│
 └──────────────────────────┘ └──────────────────────────┘
 │
 ▼
 ┌──────────────────────────┐ ┌──────────────────────────┐
 │ daemon_on_failure() │────▶│ rootcause_main() │
 │ (lib/daemon-failure.sh) │ │ (lib/root-cause.sh) │
 └──────────────────────────┘ └──────────┬───────────────┘
 │
 ┌─────────────────────┼───────────────────┐
 ▼ ▼ ▼
 ┌────────────────┐ ┌──────────────────┐ ┌────────────────┐
 │ rootcause_ │ │ rootcause_ │ │ rootcause_ │
 │ classify() │ │ suggest_fix() │ │ learn() │
 │ + history boost │ │ │ │ │
 └───────┬────────┘ └──────────────────┘ └───────┬────────┘
 │ │
 ▼ ▼
 ┌────────────────┐ ┌───────────────────┐
 │ rootcause_ │ │ root-causes.jsonl │
 │ create_ │ │ (learning data) │
 │ platform_issue │ └───────────────────┘
 └───────┬────────┘
 │
 ▼
 ┌────────────────┐
 │ GitHub Issue │
 │ (auto-created) │
 └────────────────┘
Dashboard Layer:
┌──────────────────────────────────────────────────────────────┐
│ GET /api/root-cause/breakdown │
│ → reads root-causes.jsonl │
│ → returns category distribution + avg confidence │
│ → consumed by insights.ts frontend view │
└──────────────────────────────────────────────────────────────┘

Interface Contracts

// Classification output (from rootcause_classify)
interface Classification {
 category: "rate_limit" | "context_exhaustion" | "infra_issue" | "platform_bug" | "config_error" | "external_dep" | "code_bug";
 confidence: number; // 0-99
 evidence: string[];
 suggested_action: string;
}
// Fix suggestion output (from rootcause_suggest_fix)
interface FixSuggestion {
 category: string;
 suggestions: string;
 actionability: number; // 0-100
}
// Main result (from rootcause_main)
interface RootCauseResult {
 classification: Classification;
 fix: FixSuggestion;
}
// Learning entry (root-causes.jsonl line)
interface LearningEntry {
 category: string;
 confidence: number;
 message: string; // first 200 chars
 recorded_at: string; // ISO 8601
}
// Dashboard API response (GET /api/root-cause/breakdown)
interface RootCauseBreakdown {
 breakdown: Array<{
 category: string;
 count: number;
 percentage: number;
 avg_confidence: number;
 }>;
 total: number;
 period: number; // days
}

Error Boundaries

  • Classifier errors → caught by || true in daemon integration, failure continues without classification
  • Learning write errors → caught silently, classification still returned
  • GitHub issue creation errors → caught by || true, logged but non-blocking
  • Dashboard API errors → returns empty breakdown {breakdown: [], total: 0, period: N}
  • Malformed JSONL data → jq handles gracefully with 2>/dev/null fallback

Data Flow

Error occurs → PostToolUse hook captures to error-log.jsonl
 → daemon_on_failure() extracts error snippet (last 100 lines)
 → rootcause_classify() matches against 7 category patterns
 → rootcause_boost_from_history() adjusts confidence ±10 from learning data
 → rootcause_suggest_fix() generates actionable suggestions
 → rootcause_learn() appends to root-causes.jsonl (atomic write)
 → rootcause_create_platform_issue() files GitHub issue if platform_bug/config_error >70%
 → daemon enriches GitHub retry/failure comments with root cause section
 → dashboard reads root-causes.jsonl for /api/root-cause/breakdown

Files to Modify

New Files

File Purpose
scripts/lib/root-cause.sh Core classifier library (499 lines)
scripts/sw-root-cause.sh CLI entry point (197 lines)
scripts/sw-root-cause-test.sh Test suite (374 lines, 53 tests)

Modified Files

File Changes
scripts/lib/daemon-failure.sh Wire rootcause_main() into daemon_on_failure(), enrich retry/failure comments
scripts/sw Add root-cause dispatch to CLI router
scripts/sw-pipeline.sh Source root-cause library
scripts/lib/pipeline-cli.sh Source root-cause library
scripts/lib/pipeline-commands.sh Source root-cause library
dashboard/server.ts Add GET /api/root-cause/breakdown endpoint
dashboard/src/types/api.ts Add RootCauseBreakdown TypeScript interface
dashboard/src/core/api.ts Add fetchRootCauseBreakdown() client function
dashboard/src/views/insights.ts Add failure breakdown visualization
scripts/sw-server-api-test.sh Add test for breakdown API endpoint

Implementation Steps

Step 1: Core Classifier Library (scripts/lib/root-cause.sh)

Create the root cause classification library with these functions:

  1. rootcause_classify(error_message, stage, exit_code) — Pattern-match error messages against 7 categories using cascading regex checks. Each category has distinct patterns:

    • rate_limit: 429, rate limit, throttled, quota
    • context_exhaustion: context window, token limit, auto-compact
    • infra_issue: timeout, OOM, disk full, network, socket
    • platform_bug: sw-*.sh, shipwright, unbound variable
    • config_error: missing config, invalid json, bad template
    • external_dep: npm ERR, pip install, cargo error
    • code_bug: AssertionError, SyntaxError, test fail (default fallback at 45% confidence)
  2. rootcause_boost_from_history(error_message, category, confidence) — Search ~/.shipwright/optimization/root-causes.jsonl for similar past classifications (first 100 chars match). Boost +2 per agreement (max +10, cap 99). Penalize -5 for disagreement (floor 10).

  3. rootcause_analyze_history() — Aggregate learning data into frequency map grouped by message prefix, returning top 50 patterns with count >= 2.

  4. rootcause_analyze_error_log() — Read last 50 entries from error-log.jsonl, classify each, group by category.

  5. rootcause_suggest_fix(category) — Return category-specific fix suggestions with actionability scores (50-90).

  6. rootcause_learn(category, confidence, message) — Atomic append to root-causes.jsonl with proper escaping via jq --arg.

  7. rootcause_create_platform_issue(classification_json, error_message, stage) — Create GitHub issue for platform_bug/config_error with >70% confidence. Deduplicates via error signature (cksum). Respects NO_GITHUB.

  8. rootcause_report() — Generate markdown report: category distribution, top 5 frequent causes, platform bug trend (24h vs 7d), avg confidence by category.

  9. rootcause_main(error_message, stage, exit_code) — Orchestrate: classify → suggest → learn → create issue → return JSON.

Step 2: Daemon Integration (scripts/lib/daemon-failure.sh)

  • Source root-cause.sh at module load
  • In daemon_on_failure(): call rootcause_main() with last 100 lines of issue log
  • Extract rc_category, rc_confidence, rc_suggestions from JSON result
  • Emit daemon.root_cause event with category, confidence, issue number
  • Build rc_section for GitHub retry comments (collapsible details with root cause + suggestions)
  • Build rc_final_section for final failure comments

Step 3: CLI Entry Point (scripts/sw-root-cause.sh)

Commands:

  • shipwright root-cause classify <message> — Classify an error message (supports stdin)
    • --stage <stage> — Stage context
    • --exit-code <code> — Exit code context
    • --json — JSON output
  • shipwright root-cause analyze — Analyze error-log.jsonl patterns
  • shipwright root-cause report — Generate full report from learning history
  • shipwright root-cause history [limit] — Show recent classifications (default 20)

Step 4: CLI Router (scripts/sw)

Add root-cause) case to the main dispatcher.

Step 5: Dashboard API (dashboard/server.ts)

Add GET /api/root-cause/breakdown endpoint:

  • Query param: period (default 30 days)
  • Read ~/.shipwright/optimization/root-causes.jsonl
  • Filter by recorded_at >= cutoff
  • Group by category, count, avg confidence
  • Return sorted breakdown with percentages

Step 6: Dashboard Frontend

  • dashboard/src/types/api.ts: Add RootCauseBreakdown interface
  • dashboard/src/core/api.ts: Add fetchRootCauseBreakdown() function
  • dashboard/src/views/insights.ts: Add failure breakdown visualization with category bars and confidence indicators

Step 7: Pipeline Source Integration

Ensure root-cause.sh is sourced in sw-pipeline.sh, lib/pipeline-cli.sh, lib/pipeline-commands.sh so the classifier is available throughout the pipeline.

Step 8: Tests

Write comprehensive tests in scripts/sw-root-cause-test.sh:

  • Classification tests (9): Each category + fallback + empty input
  • Error log analysis (2): With/without error-log.jsonl
  • Fix suggestions (6): Each category returns appropriate suggestions
  • Learning system (3): Write, accumulate, verify JSONL format
  • Platform issue creation (3): NO_GITHUB skip, confidence threshold, category filter
  • Report generation (2): Category distribution, trend analysis
  • Historical pattern analysis (7): Boost, penalty, empty state, multiple matches
  • Integration tests (2): Full classify→learn→boost workflow
  • CLI entry points (9): Each subcommand + help + version
  • Daemon integration (6): Verify daemon-failure.sh calls rootcause functions

Add dashboard API test to scripts/sw-server-api-test.sh.


Task Checklist

  • Task 1: Create scripts/lib/root-cause.sh with classifier, learning, issue creation functions
  • Task 2: Add rootcause_analyze_history() and rootcause_boost_from_history() for historical learning
  • Task 3: Wire root cause classifier into daemon_on_failure() in scripts/lib/daemon-failure.sh
  • Task 4: Enhance daemon retry/failure GitHub comments with root cause classification sections
  • Task 5: Create scripts/sw-root-cause.sh CLI entry point with classify/analyze/report/history
  • Task 6: Add root-cause dispatch to scripts/sw CLI router
  • Task 7: Source root-cause.sh in pipeline scripts (sw-pipeline.sh, pipeline-cli.sh, pipeline-commands.sh)
  • Task 8: Add GET /api/root-cause/breakdown endpoint to dashboard/server.ts
  • Task 9: Add RootCauseBreakdown TypeScript interface and API client function
  • Task 10: Add failure breakdown visualization to dashboard/src/views/insights.ts
  • Task 11: Write test suite scripts/sw-root-cause-test.sh (~53 tests across 10 test groups)
  • Task 12: Add dashboard API test for /api/root-cause/breakdown
  • Task 13: Run sw-root-cause-test.sh, sw-lib-daemon-failure-test.sh, and sw-server-api-test.sh
  • Task 14: Run dashboard vitest suite to verify TypeScript changes

Testing Approach

Test Pyramid Breakdown

  • Unit tests (40): Classification per category (9), fix suggestions (6), learning system (3), history analysis (7), error log analysis (2), report generation (2), platform issue filters (3), CLI entry points (8)
  • Integration tests (8): Full classify→learn→boost workflow (2), daemon integration (6)
  • Dashboard tests (5): API endpoint response shape (1), Vitest TypeScript compilation (4 — types, API client, view rendering)

Coverage Targets

  • Classification accuracy: 100% of 7 categories + fallback tested
  • Historical learning: boost, penalty, empty state, multi-match scenarios
  • Platform issue creation: all filter conditions (NO_GITHUB, confidence < 70, wrong category)
  • Daemon integration: verify rootcause_main called, events emitted, comments enriched

Critical Paths to Test

  1. Happy path: Error message → correct classification → learning → issue creation (for platform bugs)
  2. Error cases: Empty message → unknown/0 confidence, missing history file → no boost, gh CLI unavailable → skip issue
  3. Edge cases: Very long error messages (truncated to 200 chars in learning), confidence boundary (exactly 70 → skip issue creation since threshold is >70), multiple categories matching (first match wins)

Data Flow Diagram

error-log.jsonl ──► rootcause_analyze_error_log() ──► {patterns_analyzed, top_categories}
 │
 ▼ (failure point: malformed JSONL)
 jq parse error → empty result
daemon log tail ──► rootcause_main() ──► rootcause_classify() ──► {category, confidence}
 │ │
 │ ▼ (failure point: no regex match)
 │ default: code_bug @ 45%
 │ │
 ├──► rootcause_boost_from_history() ──► adjusted confidence
 │ ▼ (failure point: missing JSONL)
 │ pass-through original confidence
 │
 ├──► rootcause_learn() ──► root-causes.jsonl (append)
 │ ▼ (failure point: write permission)
 │ skip learning, continue
 │
 ├──► rootcause_create_platform_issue() ──► GitHub issue
 │ ▼ (failure point: gh CLI / NO_GITHUB)
 │ skip issue creation, continue
 │
 └──► return {classification, fix} JSON

Idempotency Strategy

  • Learning entries: Append-only JSONL with timestamps. Duplicate entries are harmless — historical analysis uses frequency counting which handles duplicates naturally.
  • GitHub issue creation: Deduplication via error signature (cksum). Before creating, searches for existing open issues with matching signature. If found, returns existing issue URL instead of creating duplicate.
  • Classifier invocation: Pure function (given same input → same output, modulo history boosting). Safe to retry.

Rollback Plan

  1. Remove the source "$SCRIPT_DIR/lib/root-cause.sh" lines from daemon-failure.sh, pipeline scripts
  2. Remove rootcause_main call block from daemon_on_failure()
  3. Remove rc_section and rc_final_section from GitHub comment templates
  4. Remove root-cause) case from CLI router
  5. Remove /api/root-cause/breakdown endpoint from dashboard/server.ts
  6. Remove TypeScript types and API client additions
  7. Remove insights view additions
  8. Data files (root-causes.jsonl, error-log.jsonl) are append-only and can be left in place

No schema migrations needed — all persistence is via append-only JSONL files.


Endpoint Specification

GET /api/root-cause/breakdown

  • Path: /api/root-cause/breakdown
  • Query params: period (integer, days, default: 30)
  • Request body: None
  • Response (200):
    {
     "breakdown": [
     {"category": "code_bug", "count": 42, "percentage": 35, "avg_confidence": 78},
     {"category": "platform_bug", "count": 12, "percentage": 10, "avg_confidence": 82}
     ],
     "total": 120,
     "period": 30
    }
  • Response (200, empty): {"breakdown": [], "total": 0, "period": 30} (no data file or no entries in period)
  • Error codes: None — endpoint always returns 200 with empty data on error

Error Codes (CLI)

Exit Code Meaning
0 Success
1 Unknown command, missing argument, or library not loaded

Rate Limiting

Not applicable — this is a local CLI tool and dashboard endpoint, not a public API.

Versioning

All scripts carry VERSION="3.2.4" matching the project version. No API versioning needed — the endpoint is internal to the dashboard.


Definition of Done

  • Root cause classifier library exists with 7 categories and confidence scoring
  • Historical pattern learning adjusts confidence based on past classifications
  • Daemon calls classifier on every pipeline failure and emits events
  • GitHub retry/failure comments include root cause analysis section
  • Platform bugs with >70% confidence auto-create GitHub issues (respects NO_GITHUB)
  • shipwright root-cause CLI works with classify/analyze/report/history subcommands
  • Dashboard API returns failure breakdown by category with period filtering
  • Dashboard frontend shows failure breakdown visualization
  • 53+ tests pass across root-cause, daemon-failure, and dashboard test suites
  • No regressions in existing test suites (pipeline, e2e-smoke, dashboard vitest)
  • PR created and ready for review

Current Implementation Status

This feature has been through 6 build iterations on branch feat/-failure-root-cause-classifier-with-auto-184. All acceptance criteria are met and all tests pass:

Criterion Status Evidence
Classifier categorizes root cause DONE scripts/lib/root-cause.sh — 7 categories with confidence scoring
Decision tree from historical patterns DONE rootcause_boost_from_history() + rootcause_analyze_history()
Platform bugs auto-create GitHub issues DONE rootcause_create_platform_issue() with dedup + confidence gate
Dashboard shows failure breakdown DONE /api/root-cause/breakdown + insights.ts visualization
All tests passing DONE 53/53 root-cause, 34/34 daemon-failure, 49/49 API, 284/284 vitest, 19/19 e2e-smoke

Remaining: PR creation and merge.

Clone this wiki locally

AltStyle によって変換されたページ (->オリジナル) /