-
Notifications
You must be signed in to change notification settings - Fork 0
Pipeline Design 206
ADR written to .claude/pipeline-artifacts/design.md.
Key findings from the codebase review that shaped the design:
-
Existing
cmd_trace()is broken for current events — it matchespipeline_start/stage_startbut the actual schema usespipeline.started/stage.started. The newcmd_trace_export()is a parallel function rather than a refactor to avoid breaking unknown consumers. -
Events already carry
ts_epoch(integer seconds) — this simplifies nanosecond conversion to a simple multiply-by-10^9, avoiding all cross-platformdateparsing issues. -
The dispatch pattern at
sw-pipeline.sh:3178andsw-otel.sh:582uses a cleancasestatement — adding new subcommands is mechanical. -
Auto-export hooks into
pipeline_cleanup_worktree()at line 2071 rather than the plan's suggested line 2727, which is the actual cleanup function in the codebase. -
Added
otel.export_failedevent type alongsideotel.trace_exported— the plan only had the success event, but failure observability matters for the auto-export path where errors are intentionally swallowed.
Constraints:
- Bash 3.2 compatible (no associative arrays, no
readarray, no${var,,}) - Must use
jq --argfor JSON construction (never string interpolation) - Pure bash + jq — no new dependencies
- Non-blocking: auto-export must never fail the pipeline
Add a new cmd_trace_export() function to scripts/sw-otel.sh that produces spec-compliant OTLP/HTTP JSON. This is a new function alongside the existing cmd_trace() — the existing function is left intact to avoid breaking current consumers.
-
New function, not a refactor of
cmd_trace(): The existing function has unknown consumers.cmd_trace_export()produces correct OTLP; a future cleanup can deprecatecmd_trace(). -
Deterministic span/trace IDs via sha256:
traceId = sha256(run-id) | head -c 32,spanId = sha256(run-id + stage) | head -c 16. This makes exports idempotent — re-exporting the same run produces identical output, enabling safe retries and diffing. -
Run-id matching against both
job_idandissuefields: Pipeline events carryjob_id; some carryissuenumber. Grep pre-filtersevents.jsonlbefore piping to jq, bounding I/O for large files. -
Nanosecond timestamps from ISO strings: Events carry
ts_epoch(integer seconds). Multiply by1000000000for nanosecond precision. Sub-second precision is unavailable in events, so this is exact for our data. Nodateparsing needed — use thets_epochfield directly. -
OTLP attribute encoding: All attributes use the array-of-
{key, value}format per the OTLP spec. Values are typed:stringValuefor strings,intValuefor integers (as strings per proto3 JSON),doubleValuefor floats. -
Root span from
pipeline.started/pipeline.completed; child spans fromstage.*events: Each stage span'sparentSpanIdreferences the root pipeline span. Skipped stages getSPAN_KIND_INTERNALwith statusUNSET. Failed stages get status code2(ERROR) with the error message. -
Auto-export fires in
pipeline_cleanup_worktree()(sw-pipeline.sh:2071): After a successful pipeline completion, ifOTEL_EXPORTER_OTLP_ENDPOINTis set, spawnsw-otel.sh trace-export <id> --sendwith stderr redirected and|| trueto ensure it never blocks cleanup. -
pipeline exportsubcommand: Thin delegation — parses--format otel(default and only format), forwards remaining args tosw-otel.sh trace-export.
User: shipwright pipeline export --format otel <run-id>
→ sw-pipeline.sh dispatches to sw-otel.sh trace-export <run-id>
→ grep filters events.jsonl by run-id (job_id or issue)
→ jq builds root span from pipeline.started/completed pair
→ jq builds child spans from stage.started → stage.completed/failed/skipped pairs
→ jq assembles OTLP resourceSpans envelope
→ stdout (or --output file, or --send POST to OTLP endpoint)
| Boundary | Behavior |
|---|---|
| Malformed event lines |
jq returns empty — line skipped, warning to stderr |
| No matching events for run-id |
error() + exit 1 |
Missing jq
|
error() with install instructions + exit 1 |
--send fails (curl error) |
error() + emit otel.export_failed event + exit 1 |
| Auto-export path failure | Swallowed by `2>/dev/null |
-
Refactor existing
cmd_trace()in-place — Pros: single function, no duplication / Cons: breaks unknown consumers of current output format, riskier change. The current function uses legacy event type names (pipeline_startvspipeline.started) and non-standard OTLP structure. Migrating it would be a breaking change with unclear blast radius. -
New standalone
sw-pipeline-export.shscript — Pros: clean separation, independent lifecycle / Cons: duplicates event-reading patterns,EVENTS_FILEpath management,emit_event()helpers already insw-otel.sh. Inconsistent with the existing pattern where all OTel concerns live insw-otel.sh. -
Node.js implementation using
@opentelemetry/sdk-trace-base— Pros: official SDK, guaranteed spec compliance / Cons: new dependency, heavier runtime, breaks the pure-bash pattern of the scripts directory. The project's shell scripts intentionally avoid Node dependencies for portability.
| File | Lines affected | Change |
|---|---|---|
scripts/sw-otel.sh |
+~150 lines after line 319 | New cmd_trace_export() function |
scripts/sw-otel.sh |
lines 540-577 (help) | Add trace-export to help text |
scripts/sw-otel.sh |
lines 582-611 (dispatch) | Add trace-export case |
scripts/sw-pipeline.sh |
lines 3178-3196 (dispatch) | Add export case |
scripts/sw-pipeline.sh |
lines 340-379 (help) | Add export to help text |
scripts/sw-pipeline.sh |
lines 2071-2105 (cleanup) | Add auto-export hook |
config/event-schema.json |
end of event_types
|
Add otel.trace_exported and otel.export_failed types |
scripts/sw-otel-test.sh |
+~120 lines | 10 new test cases for trace-export
|
| File | Purpose |
|---|---|
docs/observability.md |
Jaeger/Honeycomb integration guide with Docker setup example |
-
No new dependencies. Uses existing
jq,curl,sha256sum/shasum,date,grep. - Cross-platform sha256:
sha256sum(Linux) orshasum -a 256(macOS) — add asha256_hex()helper in the function using the same pattern ascompat.sh.
| Risk | Severity | Mitigation |
|---|---|---|
| OTLP JSON doesn't validate against Jaeger | Medium | Test against OTLP proto3 JSON spec field names; include a test that validates structure with jq schema checks |
| Nanosecond timestamp precision | Low | Events carry ts_epoch as integer seconds; multiply by 10^9 — no cross-platform date issues |
| Large events.jsonl (>100K lines) causes slowness | Low | Pre-filter with grep for run-id before jq parsing; stays under 5s for 100K lines |
pipeline_cleanup_worktree() modifying shared cleanup path |
Low | Auto-export is appended after existing logic, guarded by env var check, non-blocking |
-
shipwright otel trace-export <run-id>outputs valid JSON parseable byjq - Output has exactly one
resourceSpansentry withservice.name = "shipwright" - Root pipeline span has empty
parentSpanId, correcttraceId, nano timestamps - Each stage span has
parentSpanIdequal to root span'sspanId - Failed stage spans have
status.code = 2(ERROR) with message - Completed stage spans have
status.code = 1(OK) - Skipped stage spans have
status.code = 0(UNSET) - Attributes use OTLP array-of-key-value format with typed values
-
--output <file>writes to file instead of stdout -
--sendPOSTs toOTEL_EXPORTER_OTLP_ENDPOINT/v1/traceswith correct Content-Type - Auto-export triggers only when
OTEL_EXPORTER_OTLP_ENDPOINTis set - Auto-export failure does not affect pipeline exit code
- Re-exporting same run-id produces identical output (deterministic IDs)
- Run-id matches both
job_idfield andissuefield - Missing run-id returns exit 1 with descriptive error
- All 10 new tests in
sw-otel-test.shpass - All existing tests pass (
npm test) - No Bash 3.2 incompatibilities (no associative arrays, no
readarray)
CLI endpoint: shipwright pipeline export [--format otel] <run-id>
- Input:
run-id— string matchingjob_idorissuenumber in events - Output: OTLP JSON to stdout (exit 0), or error to stderr (exit 1)
- Flags:
--format otel(default, only format),--output <file>,--send
CLI endpoint: shipwright otel trace-export <run-id> [--output <file>] [--send]
- Same behavior (pipeline export delegates here)
Error codes:
- Exit 0: Successful export
- Exit 1: Missing argument, no matching events, jq unavailable, or
--sendfailure
Rate Limiting: N/A — CLI tool, not a service. Versioning: N/A — internal CLI with no external API contract.
Not applicable — this is a local CLI export tool, not a deployed service.
Self-monitoring is handled by:
-
otel.trace_exportedevent emitted on successful export (capturesrun_id,endpoint,spans_count) -
otel.export_failedevent emitted on--sendfailure (capturesrun_id,error) - Both events are queryable via existing
shipwright otel metrics
Not applicable — additive CLI feature with no deployed runtime component. Failures are local and immediately visible to the user. The auto-export path is explicitly non-blocking (|| true), so there is no production blast radius to monitor.