Pipeline Design 206

ezigus edited this page Mar 20, 2026 · 1 revision

ADR written to .claude/pipeline-artifacts/design.md.

Key findings from the codebase review that shaped the design:

Existing cmd_trace() is broken for current events — it matches pipeline_start/stage_start but the actual schema uses pipeline.started/stage.started. The new cmd_trace_export() is a parallel function rather than a refactor to avoid breaking unknown consumers.
Events already carry ts_epoch (integer seconds) — this simplifies nanosecond conversion to a simple multiply-by-10^9, avoiding all cross-platform date parsing issues.
The dispatch pattern at sw-pipeline.sh:3178 and sw-otel.sh:582 uses a clean case statement — adding new subcommands is mechanical.
Auto-export hooks into pipeline_cleanup_worktree() at line 2071 rather than the plan's suggested line 2727, which is the actual cleanup function in the codebase.
Added otel.export_failed event type alongside otel.trace_exported — the plan only had the success event, but failure observability matters for the auto-export path where errors are intentionally swallowed.

Constraints:

Bash 3.2 compatible (no associative arrays, no readarray, no ${var,,})
Must use jq --arg for JSON construction (never string interpolation)
Pure bash + jq — no new dependencies
Non-blocking: auto-export must never fail the pipeline

Decision

Add a new cmd_trace_export() function to scripts/sw-otel.sh that produces spec-compliant OTLP/HTTP JSON. This is a new function alongside the existing cmd_trace() — the existing function is left intact to avoid breaking current consumers.

Core design choices:

New function, not a refactor of cmd_trace(): The existing function has unknown consumers. cmd_trace_export() produces correct OTLP; a future cleanup can deprecate cmd_trace().
Deterministic span/trace IDs via sha256: traceId = sha256(run-id) | head -c 32, spanId = sha256(run-id + stage) | head -c 16. This makes exports idempotent — re-exporting the same run produces identical output, enabling safe retries and diffing.
Run-id matching against both job_id and issue fields: Pipeline events carry job_id; some carry issue number. Grep pre-filters events.jsonl before piping to jq, bounding I/O for large files.
Nanosecond timestamps from ISO strings: Events carry ts_epoch (integer seconds). Multiply by 1000000000 for nanosecond precision. Sub-second precision is unavailable in events, so this is exact for our data. No date parsing needed — use the ts_epoch field directly.
OTLP attribute encoding: All attributes use the array-of-{key, value} format per the OTLP spec. Values are typed: stringValue for strings, intValue for integers (as strings per proto3 JSON), doubleValue for floats.
Root span from pipeline.started/pipeline.completed; child spans from stage.* events: Each stage span's parentSpanId references the root pipeline span. Skipped stages get SPAN_KIND_INTERNAL with status UNSET. Failed stages get status code 2 (ERROR) with the error message.
Auto-export fires in pipeline_cleanup_worktree() (sw-pipeline.sh:2071): After a successful pipeline completion, if OTEL_EXPORTER_OTLP_ENDPOINT is set, spawn sw-otel.sh trace-export <id> --send with stderr redirected and || true to ensure it never blocks cleanup.
pipeline export subcommand: Thin delegation — parses --format otel (default and only format), forwards remaining args to sw-otel.sh trace-export.

Data flow:

User: shipwright pipeline export --format otel <run-id>
 → sw-pipeline.sh dispatches to sw-otel.sh trace-export <run-id>
 → grep filters events.jsonl by run-id (job_id or issue)
 → jq builds root span from pipeline.started/completed pair
 → jq builds child spans from stage.started → stage.completed/failed/skipped pairs
 → jq assembles OTLP resourceSpans envelope
 → stdout (or --output file, or --send POST to OTLP endpoint)

Error boundaries:

Boundary	Behavior
Malformed event lines	`jq` returns empty — line skipped, warning to stderr
No matching events for run-id	`error()` + exit 1
Missing `jq`	`error()` with install instructions + exit 1
`--send` fails (curl error)	`error()` + emit `otel.export_failed` event + exit 1
Auto-export path failure	Swallowed by `2>/dev/null

Alternatives Considered

Refactor existing cmd_trace() in-place — Pros: single function, no duplication / Cons: breaks unknown consumers of current output format, riskier change. The current function uses legacy event type names (pipeline_start vs pipeline.started) and non-standard OTLP structure. Migrating it would be a breaking change with unclear blast radius.
New standalone sw-pipeline-export.sh script — Pros: clean separation, independent lifecycle / Cons: duplicates event-reading patterns, EVENTS_FILE path management, emit_event() helpers already in sw-otel.sh. Inconsistent with the existing pattern where all OTel concerns live in sw-otel.sh.
Node.js implementation using @opentelemetry/sdk-trace-base — Pros: official SDK, guaranteed spec compliance / Cons: new dependency, heavier runtime, breaks the pure-bash pattern of the scripts directory. The project's shell scripts intentionally avoid Node dependencies for portability.

Implementation Plan

Files to modify

File	Lines affected	Change
`scripts/sw-otel.sh`	+~150 lines after line 319	New `cmd_trace_export()` function
`scripts/sw-otel.sh`	lines 540-577 (help)	Add `trace-export` to help text
`scripts/sw-otel.sh`	lines 582-611 (dispatch)	Add `trace-export` case
`scripts/sw-pipeline.sh`	lines 3178-3196 (dispatch)	Add `export` case
`scripts/sw-pipeline.sh`	lines 340-379 (help)	Add `export` to help text
`scripts/sw-pipeline.sh`	lines 2071-2105 (cleanup)	Add auto-export hook
`config/event-schema.json`	end of `event_types`	Add `otel.trace_exported` and `otel.export_failed` types
`scripts/sw-otel-test.sh`	+~120 lines	10 new test cases for `trace-export`

Files to create

File	Purpose
`docs/observability.md`	Jaeger/Honeycomb integration guide with Docker setup example

Dependencies

No new dependencies. Uses existing jq, curl, sha256sum/shasum, date, grep.
Cross-platform sha256: sha256sum (Linux) or shasum -a 256 (macOS) — add a sha256_hex() helper in the function using the same pattern as compat.sh.

Risk areas

Risk	Severity	Mitigation
OTLP JSON doesn't validate against Jaeger	Medium	Test against OTLP proto3 JSON spec field names; include a test that validates structure with `jq` schema checks
Nanosecond timestamp precision	Low	Events carry `ts_epoch` as integer seconds; multiply by 10^9 — no cross-platform `date` issues
Large events.jsonl (>100K lines) causes slowness	Low	Pre-filter with `grep` for run-id before `jq` parsing; stays under 5s for 100K lines
`pipeline_cleanup_worktree()` modifying shared cleanup path	Low	Auto-export is appended after existing logic, guarded by env var check, non-blocking

Validation Criteria

shipwright otel trace-export <run-id> outputs valid JSON parseable by jq
Output has exactly one resourceSpans entry with service.name = "shipwright"
Root pipeline span has empty parentSpanId, correct traceId, nano timestamps
Each stage span has parentSpanId equal to root span's spanId
Failed stage spans have status.code = 2 (ERROR) with message
Completed stage spans have status.code = 1 (OK)
Skipped stage spans have status.code = 0 (UNSET)
Attributes use OTLP array-of-key-value format with typed values
--output <file> writes to file instead of stdout
--send POSTs to OTEL_EXPORTER_OTLP_ENDPOINT/v1/traces with correct Content-Type
Auto-export triggers only when OTEL_EXPORTER_OTLP_ENDPOINT is set
Auto-export failure does not affect pipeline exit code
Re-exporting same run-id produces identical output (deterministic IDs)
Run-id matches both job_id field and issue field
Missing run-id returns exit 1 with descriptive error
All 10 new tests in sw-otel-test.sh pass
All existing tests pass (npm test)
No Bash 3.2 incompatibilities (no associative arrays, no readarray)

Endpoint Specification

CLI endpoint: shipwright pipeline export [--format otel] <run-id>

Input: run-id — string matching job_id or issue number in events
Output: OTLP JSON to stdout (exit 0), or error to stderr (exit 1)
Flags: --format otel (default, only format), --output <file>, --send

CLI endpoint: shipwright otel trace-export <run-id> [--output <file>] [--send]

Same behavior (pipeline export delegates here)

Error codes:

Exit 0: Successful export
Exit 1: Missing argument, no matching events, jq unavailable, or --send failure

Rate Limiting: N/A — CLI tool, not a service. Versioning: N/A — internal CLI with no external API contract.

Monitoring Checklist

Not applicable — this is a local CLI export tool, not a deployed service.

Self-monitoring is handled by:

otel.trace_exported event emitted on successful export (captures run_id, endpoint, spans_count)
otel.export_failed event emitted on --send failure (captures run_id, error)
Both events are queryable via existing shipwright otel metrics

Anomaly Detection / Log Analysis / Auto-Rollback

Not applicable — additive CLI feature with no deployed runtime component. Failures are local and immediately visible to the user. The auto-export path is explicitly non-blocking (|| true), so there is no production blast radius to monitor.

Pipeline Design 206

Decision

Core design choices:

Data flow:

Error boundaries:

Alternatives Considered

Implementation Plan

Files to modify

Files to create

Dependencies

Risk areas

Validation Criteria

Endpoint Specification

Monitoring Checklist

Anomaly Detection / Log Analysis / Auto-Rollback

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally