Pipeline Design 178

ezigus edited this page Mar 17, 2026 · 3 revisions

Now let me write the ADR with concrete codebase references.

Design: Pipeline cost forecast and budget gate with early warning

Context

Shipwright pipelines can run 12 stages, each consuming model tokens at different rates (Opus at 15ドル/75ドル per M tokens, Sonnet at 3ドル/15,ドル Haiku at 0ドル.25/1ドル.25). Today, estimate_pipeline_cost() in sw-pipeline.sh provides a rough aggregate estimate (~8K input / ~4K output per stage), and cost_check_budget() in sw-cost.sh:197 only checks whether the daily budget is already exceeded — it cannot predict whether a pipeline about to start will blow the budget. Operators discover cost overruns after the fact.

Constraints from the codebase:

Bash 3.2 compatibility — no associative arrays, no readarray, no ${var,,}
All JSON manipulation via jq --arg (no string interpolation)
Events written to ~/.shipwright/events.jsonl via emit_event
set -euo pipefail in all scripts; grep -c under pipefail produces double output (use || true + ${var:-0})
Dashboard is TypeScript/Bun with vitest; shell tests use lib/test-helpers.sh assertions
Existing --ignore-budget flag on pipeline start (line ~460 of sw-pipeline.sh)

Decision

Approach B: Per-stage forecast using template model assignments + historical durations.

The forecast engine lives in sw-cost.sh as new functions, called by sw-pipeline.sh before stage execution begins. This keeps cost logic centralized (single responsibility) and reuses the existing cost_calculate() function for per-model pricing.

Data Flow

 ┌─────────────────────────────┐
 │ Pipeline Template JSON │
 │ (enabled stages + models) │
 └──────────────┬──────────────┘
 │
 ▼
┌──────────────┐ ┌─────────────────────────────┐ ┌──────────────┐
│ events.jsonl │────▶│ cost_forecast() engine │────▶│ forecast.json│
│ (history) │ │ in sw-cost.sh │ │ (artifact) │
└──────────────┘ └──────────────┬──────────────┘ └──────────────┘
 │
 ┌─────────┴─────────┐
 ▼ ▼
 ┌──────────────┐ ┌────────────────┐
 │ Budget Gate │ │ CLI / Dashboard │
 │ (block/warn) │ │ (display) │
 └──────────────┘ └────────────────┘
 │
 ▼
 ┌────────────────────────┐
 │ Pipeline runs stages │
 │ ... │
 │ On completion: │
 │ cost_record_variance() │
 └────────────────────────┘

Component Diagram

┌─────────────────────────────────────────────────────────┐
│ sw-cost.sh │
│ │
│ ┌──────────────────┐ ┌──────────────────────────────┐ │
│ │ cost_calculate() │ │ cost_forecast() │ │
│ │ (existing) │◀─│ - reads template stages │ │
│ └──────────────────┘ │ - queries event history │ │
│ │ - applies complexity mult │ │
│ ┌──────────────────┐ │ - computes confidence │ │
│ │ cost_remaining_ │ └──────────────────────────────┘ │
│ │ budget() (exists) │ │
│ └──────────────────┘ ┌──────────────────────────────┐ │
│ │ cost_forecast_display() │ │
│ │ - renders table to stdout │ │
│ └──────────────────────────────┘ │
│ ┌──────────────────────────────┐ │
│ │ cost_record_variance() │ │
│ │ - emits forecast vs actual │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
 │
 ▼
┌─────────────────────────────────────────────────────────┐
│ sw-pipeline.sh │
│ │
│ pipeline_start(): │
│ 1. load_pipeline_config │
│ 2. cost_forecast → save artifact │
│ 3. budget_gate (block | warn | pass) │
│ 4. run stages... │
│ 5. cost_record_variance on completion │
└─────────────────────────────────────────────────────────┘
 │
 ▼
┌─────────────────────────────────────────────────────────┐
│ dashboard/server.ts │
│ GET /api/costs/forecast — shells to sw cost forecast │
│ GET /api/status — enriched with forecast from artifact │
└─────────────────────────────────────────────────────────┘
 │
 ▼
┌─────────────────────────────────────────────────────────┐
│ dashboard/src/views/pipelines.ts │
│ Queue items display: "Est: 45ドル–60ドル (medium confidence)" │
└─────────────────────────────────────────────────────────┘

Interface Contracts

×ばつ 0.8 high, ×ばつ 0.7 medium, ×ばつ 0.5 low) high_usd: number; // upper bound (total ×ばつ 1.2 high, ×ばつ 1.5 medium, ×ばつ 2.0 low) confidence: "high" | "medium" | "low"; data_points: number; // historical runs used complexity_multiplier: number; stages: Array<{ id: string; // e.g. "build", "review" model: string; // e.g. "sonnet", "opus" est_duration_s: number; est_cost_usd: number; }>; } // cost_record_variance(forecast_usd, actual_usd, confidence, template, issue) → event emitted // No return value; writes to events.jsonl // -- Budget gate return codes (in pipeline_start context) -- // 0 = proceed (under budget or budget unlimited) // 1 = warn (forecast 50–100% of remaining; pipeline proceeds with warning) // 2 = block (forecast.high_usd > remaining AND no --force-start; pipeline exits) // -- Dashboard API -- // GET /api/costs/forecast?pipeline=standard&complexity=5 // Response 200: CostForecast // Response 400: { error: { code: string, message: string } } // -- Dashboard types (additions) -- interface QueueItem { issue: number; title: string; score?: number; estimated_cost?: number; // existing field factors?: unknown; // existing field forecast?: CostForecast; // NEW }">

// -- sw-cost.sh outputs (JSON to stdout) --
// cost_forecast(pipeline_config_path, complexity) → stdout
interface CostForecast {
 total_usd: number; // point estimate
 low_usd: number; // lower bound (total ×ばつ 0.8 high, ×ばつ 0.7 medium, ×ばつ 0.5 low)
 high_usd: number; // upper bound (total ×ばつ 1.2 high, ×ばつ 1.5 medium, ×ばつ 2.0 low)
 confidence: "high" | "medium" | "low";
 data_points: number; // historical runs used
 complexity_multiplier: number;
 stages: Array<{
 id: string; // e.g. "build", "review"
 model: string; // e.g. "sonnet", "opus"
 est_duration_s: number;
 est_cost_usd: number;
 }>;
}
// cost_record_variance(forecast_usd, actual_usd, confidence, template, issue) → event emitted
// No return value; writes to events.jsonl
// -- Budget gate return codes (in pipeline_start context) --
// 0 = proceed (under budget or budget unlimited)
// 1 = warn (forecast 50–100% of remaining; pipeline proceeds with warning)
// 2 = block (forecast.high_usd > remaining AND no --force-start; pipeline exits)
// -- Dashboard API --
// GET /api/costs/forecast?pipeline=standard&complexity=5
// Response 200: CostForecast
// Response 400: { error: { code: string, message: string }}
// -- Dashboard types (additions) --
interface QueueItem {
 issue: number;
 title: string;
 score?: number;
 estimated_cost?: number; // existing field
 factors?: unknown; // existing field
 forecast?: CostForecast; // NEW
}

Error Boundaries

Component	Error	Handling
`cost_forecast()`	No events.jsonl or empty	Falls back to default durations; sets confidence="low"
`cost_forecast()`	Invalid template JSON	Returns error JSON `{"error": "..."}`, pipeline logs warning and skips gate
`cost_forecast()`	jq not available	Detected at script top; forecast skipped with warning
Budget gate	`cost_remaining_budget` returns "unlimited"	Gate skipped entirely
Budget gate	Forecast fails	Pipeline proceeds with warning (forecast is advisory, not blocking-critical)
`cost_record_variance()`	Missing forecast data at completion	Skipped silently (no-op if `PIPELINE_FORECAST_USD` unset)
Dashboard endpoint	`sw cost forecast` shell-out fails	Returns 500 with error message

Confidence Calibration

Level	Data Points	Interval Width	Rationale
High	>= 20 runs	±20% (×ばつ0.8 / ×ばつ1.2)	Enough data for stable averages
Medium	5–19 runs	±30-50% (×ばつ0.7 / ×ばつ1.5)	Moderate uncertainty
Low	< 5 runs	±50-100% (×ばつ0.5 / ×ばつ2.0)	Cold start, conservative bounds

Budget Gate Logic (precise)

×ばつ 0.5 → WARN (continue) else → PASS">

remaining = cost_remaining_budget()
if remaining == "unlimited" → PASS
if FORCE_START || IGNORE_BUDGET → PASS (with audit event)
if forecast.high_usd > remaining → BLOCK (exit 2, suggest --force-start)
if forecast.total_usd > remaining ×ばつ 0.5 → WARN (continue)
else → PASS

Historical Data Query

Scan events.jsonl for stage.completed events, extract duration per stage name, compute running averages. Limit scan to tail -1000 lines for performance. Group by stage ID. This reuses the existing event format — no new data collection needed.

Alternatives Considered

Simple multiplier (stage_count ×ばつ flat_rate) — Pros: Trivial, zero dependencies. Cons: Ignores model tier differences (Opus is ×ばつ more expensive than Haiku), ignores stage duration variance (build averages 20min vs intake at 1min), produces estimates so inaccurate they erode trust in the gate. Rejected: too coarse for meaningful go/no-go decisions.
ML regression on historical runs — Pros: Could capture non-linear relationships (e.g., complexity ×ばつ model ×ばつ time-of-day). Cons: Requires training infrastructure, minimum ~100 runs for stable regression, adds Python dependency to a shell-native project, massive over-engineering for current data volume (~dozens of runs). Rejected: future enhancement when data justifies it.

Implementation Plan

Files to modify

File	Lines Changed (est.)	Purpose
`scripts/sw-cost.sh`	+200	`cost_forecast()`, `cost_forecast_display()`, `cost_record_variance()`, `forecast` CLI subcommand
`scripts/sw-pipeline.sh`	+50	`--force-start` flag, forecast + budget gate in `pipeline_start()`, variance at completion
`config/event-schema.json`	+20	`cost.forecast` and `cost.forecast_variance` event type definitions
`dashboard/server.ts`	+30	`/api/costs/forecast` endpoint, forecast in queue enrichment
`dashboard/src/types/api.ts`	+15	`CostForecast` interface, extend `QueueItem`
`dashboard/src/views/pipelines.ts`	+15	Forecast display on queued items
`scripts/sw-pipeline-test.sh`	+60	Integration tests for budget gate

Files to create

File	Purpose
`src/cost-forecast.test.js`	Unit tests for forecast math and variance tracking

Dependencies

None new. Uses existing jq, awk, bash, vitest.

Risk areas

events.jsonl scan performance: Mitigated by tail -1000 + grep filter. If file exceeds ~100K lines, consider indexed lookup (future).
pipeline_start() is already ~300 lines: Adding forecast + gate adds ~50 lines of sequential logic. Inserted as a discrete block after load_pipeline_config, before state file creation — minimal entanglement with existing flow.
Bash 3.2 float arithmetic: All cost math uses awk (already the pattern in cost_calculate()). No bc dependency.
Race condition on budget check: Between forecast check and actual spend, another pipeline could start. Acceptable — the gate is advisory, not transactional. --force-start exists as escape valve.

Validation Criteria

shipwright cost forecast --pipeline standard --json returns valid CostForecast JSON
shipwright cost forecast --pipeline standard renders human-readable table with per-stage breakdown
Cold start (empty events.jsonl): forecast uses defaults, shows "low" confidence
With 25+ historical stage.completed events: shows "high" confidence with narrow interval
Pipeline start displays forecast before executing stages
Pipeline blocked when forecast.high_usd > remaining_budget (exit code 2, message includes --force-start hint)
--force-start bypasses gate with audit event emitted
--ignore-budget also bypasses forecast gate (backward compatible)
cost.forecast event emitted at pipeline start
cost.forecast_variance event emitted at pipeline completion with forecast/actual/variance fields
All existing tests pass (npm test)
New unit tests cover: forecast calculation, confidence thresholds at boundaries (4/5/19/20 data points), complexity multiplier scaling, variance recording
New integration tests cover: gate blocks over budget, gate warns at 50-100%, --force-start override, variance event in events.jsonl
No Bash 4+ features used (verified by shellcheck or manual review)
Dashboard /api/costs/forecast returns valid JSON for all template types
Dashboard queue view shows forecast inline for queued items

Frontend Sections

Component Hierarchy

pipelines.ts (view)
 └─ renderQueueTable()
 └─ renderQueueRow(item: QueueItem)
 └─ renderForecastBadge(forecast?: CostForecast) // NEW
 - "Est: 45ドル–60ドル (medium confidence)"
 - Color-coded: green (under 50% budget), yellow (50-100%), red (over)

State lives in FleetState.queue[].forecast — fetched from server, no local state management needed. The forecast data flows from GET /api/status through to render.

State Management Approach

No new state stores. Forecast data is embedded in the existing FleetState response from /api/status. The queue enrichment in server.ts reads cost-forecast.json from pipeline artifacts when available. Pure props-down data flow.

Accessibility Checklist

Forecast badge uses semantic <span> with aria-label="Estimated cost: 45ドル to 60,ドル medium confidence"
Color coding supplemented with text labels (not color-only)
Budget warning uses role="alert" for screen reader announcement
Table cells use <td> with column headers in <th> (existing pattern)

Responsive Breakpoints

320px: Forecast column hidden; available via row expansion (existing mobile pattern)
768px+: Forecast shown as compact badge: "45ドル–60ドル (M)"
1024px+: Full forecast text: "Est: 45ドル–60ドル (medium confidence, 12 runs)"
1440px+: No change from 1024px

The dashboard already uses a responsive table pattern — forecast column follows the same hide/show behavior as existing optional columns.

Pipeline Design 178

Design: Pipeline cost forecast and budget gate with early warning

Context

Decision

Data Flow

Component Diagram

Interface Contracts

Error Boundaries

Confidence Calibration

Budget Gate Logic (precise)

Historical Data Query

Alternatives Considered

Implementation Plan

Files to modify

Files to create

Dependencies

Risk areas

Validation Criteria

Frontend Sections

Component Hierarchy

State Management Approach

Accessibility Checklist

Responsive Breakpoints

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!