Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Pipeline Design 206

ezigus edited this page Mar 20, 2026 · 1 revision

ADR written to .claude/pipeline-artifacts/design.md.

Key findings from the codebase review that shaped the design:

  1. Existing cmd_trace() is broken for current events — it matches pipeline_start/stage_start but the actual schema uses pipeline.started/stage.started. The new cmd_trace_export() is a parallel function rather than a refactor to avoid breaking unknown consumers.

  2. Events already carry ts_epoch (integer seconds) — this simplifies nanosecond conversion to a simple multiply-by-10^9, avoiding all cross-platform date parsing issues.

  3. The dispatch pattern at sw-pipeline.sh:3178 and sw-otel.sh:582 uses a clean case statement — adding new subcommands is mechanical.

  4. Auto-export hooks into pipeline_cleanup_worktree() at line 2071 rather than the plan's suggested line 2727, which is the actual cleanup function in the codebase.

  5. Added otel.export_failed event type alongside otel.trace_exported — the plan only had the success event, but failure observability matters for the auto-export path where errors are intentionally swallowed.

Constraints:

  • Bash 3.2 compatible (no associative arrays, no readarray, no ${var,,})
  • Must use jq --arg for JSON construction (never string interpolation)
  • Pure bash + jq — no new dependencies
  • Non-blocking: auto-export must never fail the pipeline

Decision

Add a new cmd_trace_export() function to scripts/sw-otel.sh that produces spec-compliant OTLP/HTTP JSON. This is a new function alongside the existing cmd_trace() — the existing function is left intact to avoid breaking current consumers.

Core design choices:

  1. New function, not a refactor of cmd_trace(): The existing function has unknown consumers. cmd_trace_export() produces correct OTLP; a future cleanup can deprecate cmd_trace().

  2. Deterministic span/trace IDs via sha256: traceId = sha256(run-id) | head -c 32, spanId = sha256(run-id + stage) | head -c 16. This makes exports idempotent — re-exporting the same run produces identical output, enabling safe retries and diffing.

  3. Run-id matching against both job_id and issue fields: Pipeline events carry job_id; some carry issue number. Grep pre-filters events.jsonl before piping to jq, bounding I/O for large files.

  4. Nanosecond timestamps from ISO strings: Events carry ts_epoch (integer seconds). Multiply by 1000000000 for nanosecond precision. Sub-second precision is unavailable in events, so this is exact for our data. No date parsing needed — use the ts_epoch field directly.

  5. OTLP attribute encoding: All attributes use the array-of-{key, value} format per the OTLP spec. Values are typed: stringValue for strings, intValue for integers (as strings per proto3 JSON), doubleValue for floats.

  6. Root span from pipeline.started/pipeline.completed; child spans from stage.* events: Each stage span's parentSpanId references the root pipeline span. Skipped stages get SPAN_KIND_INTERNAL with status UNSET. Failed stages get status code 2 (ERROR) with the error message.

  7. Auto-export fires in pipeline_cleanup_worktree() (sw-pipeline.sh:2071): After a successful pipeline completion, if OTEL_EXPORTER_OTLP_ENDPOINT is set, spawn sw-otel.sh trace-export <id> --send with stderr redirected and || true to ensure it never blocks cleanup.

  8. pipeline export subcommand: Thin delegation — parses --format otel (default and only format), forwards remaining args to sw-otel.sh trace-export.

Data flow:

User: shipwright pipeline export --format otel <run-id>
 → sw-pipeline.sh dispatches to sw-otel.sh trace-export <run-id>
 → grep filters events.jsonl by run-id (job_id or issue)
 → jq builds root span from pipeline.started/completed pair
 → jq builds child spans from stage.started → stage.completed/failed/skipped pairs
 → jq assembles OTLP resourceSpans envelope
 → stdout (or --output file, or --send POST to OTLP endpoint)

Error boundaries:

Boundary Behavior
Malformed event lines jq returns empty — line skipped, warning to stderr
No matching events for run-id error() + exit 1
Missing jq error() with install instructions + exit 1
--send fails (curl error) error() + emit otel.export_failed event + exit 1
Auto-export path failure Swallowed by `2>/dev/null

Alternatives Considered

  1. Refactor existing cmd_trace() in-place — Pros: single function, no duplication / Cons: breaks unknown consumers of current output format, riskier change. The current function uses legacy event type names (pipeline_start vs pipeline.started) and non-standard OTLP structure. Migrating it would be a breaking change with unclear blast radius.

  2. New standalone sw-pipeline-export.sh script — Pros: clean separation, independent lifecycle / Cons: duplicates event-reading patterns, EVENTS_FILE path management, emit_event() helpers already in sw-otel.sh. Inconsistent with the existing pattern where all OTel concerns live in sw-otel.sh.

  3. Node.js implementation using @opentelemetry/sdk-trace-base — Pros: official SDK, guaranteed spec compliance / Cons: new dependency, heavier runtime, breaks the pure-bash pattern of the scripts directory. The project's shell scripts intentionally avoid Node dependencies for portability.

Implementation Plan

Files to modify

File Lines affected Change
scripts/sw-otel.sh +~150 lines after line 319 New cmd_trace_export() function
scripts/sw-otel.sh lines 540-577 (help) Add trace-export to help text
scripts/sw-otel.sh lines 582-611 (dispatch) Add trace-export case
scripts/sw-pipeline.sh lines 3178-3196 (dispatch) Add export case
scripts/sw-pipeline.sh lines 340-379 (help) Add export to help text
scripts/sw-pipeline.sh lines 2071-2105 (cleanup) Add auto-export hook
config/event-schema.json end of event_types Add otel.trace_exported and otel.export_failed types
scripts/sw-otel-test.sh +~120 lines 10 new test cases for trace-export

Files to create

File Purpose
docs/observability.md Jaeger/Honeycomb integration guide with Docker setup example

Dependencies

  • No new dependencies. Uses existing jq, curl, sha256sum/shasum, date, grep.
  • Cross-platform sha256: sha256sum (Linux) or shasum -a 256 (macOS) — add a sha256_hex() helper in the function using the same pattern as compat.sh.

Risk areas

Risk Severity Mitigation
OTLP JSON doesn't validate against Jaeger Medium Test against OTLP proto3 JSON spec field names; include a test that validates structure with jq schema checks
Nanosecond timestamp precision Low Events carry ts_epoch as integer seconds; multiply by 10^9 — no cross-platform date issues
Large events.jsonl (>100K lines) causes slowness Low Pre-filter with grep for run-id before jq parsing; stays under 5s for 100K lines
pipeline_cleanup_worktree() modifying shared cleanup path Low Auto-export is appended after existing logic, guarded by env var check, non-blocking

Validation Criteria

  • shipwright otel trace-export <run-id> outputs valid JSON parseable by jq
  • Output has exactly one resourceSpans entry with service.name = "shipwright"
  • Root pipeline span has empty parentSpanId, correct traceId, nano timestamps
  • Each stage span has parentSpanId equal to root span's spanId
  • Failed stage spans have status.code = 2 (ERROR) with message
  • Completed stage spans have status.code = 1 (OK)
  • Skipped stage spans have status.code = 0 (UNSET)
  • Attributes use OTLP array-of-key-value format with typed values
  • --output <file> writes to file instead of stdout
  • --send POSTs to OTEL_EXPORTER_OTLP_ENDPOINT/v1/traces with correct Content-Type
  • Auto-export triggers only when OTEL_EXPORTER_OTLP_ENDPOINT is set
  • Auto-export failure does not affect pipeline exit code
  • Re-exporting same run-id produces identical output (deterministic IDs)
  • Run-id matches both job_id field and issue field
  • Missing run-id returns exit 1 with descriptive error
  • All 10 new tests in sw-otel-test.sh pass
  • All existing tests pass (npm test)
  • No Bash 3.2 incompatibilities (no associative arrays, no readarray)

Endpoint Specification

CLI endpoint: shipwright pipeline export [--format otel] <run-id>

  • Input: run-id — string matching job_id or issue number in events
  • Output: OTLP JSON to stdout (exit 0), or error to stderr (exit 1)
  • Flags: --format otel (default, only format), --output <file>, --send

CLI endpoint: shipwright otel trace-export <run-id> [--output <file>] [--send]

  • Same behavior (pipeline export delegates here)

Error codes:

  • Exit 0: Successful export
  • Exit 1: Missing argument, no matching events, jq unavailable, or --send failure

Rate Limiting: N/A — CLI tool, not a service. Versioning: N/A — internal CLI with no external API contract.

Monitoring Checklist

Not applicable — this is a local CLI export tool, not a deployed service.

Self-monitoring is handled by:

  • otel.trace_exported event emitted on successful export (captures run_id, endpoint, spans_count)
  • otel.export_failed event emitted on --send failure (captures run_id, error)
  • Both events are queryable via existing shipwright otel metrics

Anomaly Detection / Log Analysis / Auto-Rollback

Not applicable — additive CLI feature with no deployed runtime component. Failures are local and immediately visible to the user. The auto-export path is explicitly non-blocking (|| true), so there is no production blast radius to monitor.

Clone this wiki locally

AltStyle によって変換されたページ (->オリジナル) /