Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Pipeline Design 327

ezigus edited this page Apr 19, 2026 · 1 revision

Now I have the full picture. Here's the ADR:


Design: feat(ruflo): store test stage results in ruflo and recall historical flakiness patterns

Context

The stage_test() function in scripts/lib/pipeline-stages-build.sh:596–726 currently has two gaps:

  1. No historical recall. Unlike stage_test_first (line 20–31) which calls ruflo_recall_similar_outcomes before generating tests, stage_test runs blind — it never queries ruflo for past flakiness patterns that could inform the developer reading the log.

  2. Store only fires on pass. The ruflo_store at line 719 sits after the pass-path GitHub comment. The failure branch return 1s at line 665, so failed test runs are never persisted. This means the single most valuable data point for flakiness tracking — failures — is dropped.

The stage_test_first integration (lines 20–31, 137–160) provides a proven template: declare -f guards, ruflo_available check, printf '%.2000s' truncation, || true fail-open, and learning-<repo_hash> namespace for semantic recall. Four prior memory entries document hard-won conventions: plain-text recall (not JSON), namespace matching, and declare -f guards.

Constraints

  • Bash 3.2 compatible (no associative arrays, no ${var,,})
  • All ruflo calls must be fail-open — test execution must never be blocked by ruflo failures
  • set -euo pipefail is active; unguarded failures will abort the pipeline
  • Test file (sw-ruflo-adapter-test.sh, 2531 lines) uses a mock-stub pattern: override functions, log args to temp files, assert against those files

Decision

Data Flow

stage_test() entry
 │
 ├─ 1. RECALL: ruflo_recall("test flakiness patterns failures",
 │ "pipeline-<PIPELINE_ID>")
 │ → log result via info() for human visibility
 │ → does NOT gate test execution
 │
 ├─ 2. RUN TESTS: bash -c "$test_cmd"
 │
 ├─ 3a. ON FAIL (before return 1 at line 665):
 │ ruflo_store("stage-test-result",
 │ "Tests FAILED (exit $test_exit). Count: <N>. Cmd: <cmd>. Coverage: 0%. Time: <ISO8601>.",
 │ "pipeline-<PIPELINE_ID>",
 │ "test,stage_test,failed")
 │
 └─ 3b. ON PASS (replacing lines 718–723):
 ruflo_store("stage-test-result",
 "Tests PASSED. Count: <N>. Cmd: <cmd>. Coverage: <pct>%. Time: <ISO8601>.",
 "pipeline-<PIPELINE_ID>",
 "test,stage_test,passed")

Key Design Choices

  • ruflo_recall (not ruflo_recall_similar_outcomes): The recall here is a simple keyword query against the pipeline namespace for prior test results — not a semantic TDD outcome lookup. ruflo_recall is the correct function because we're searching a fixed namespace with a keyword query, not doing cross-namespace semantic matching.

  • Namespace: pipeline-<PIPELINE_ID>: Stores and recalls against the same pipeline namespace. This differs from stage_test_first which uses learning-<repo_hash> for cross-pipeline TDD recall. Here, the goal is run-to-run flakiness tracking within the pipeline's history, so pipeline-* is correct.

  • Tags for classification: test,stage_test,passed or test,stage_test,failed enable future queries to filter by outcome. Tags are comma-separated strings matching the existing convention (see stage_test_first at line 157: "tdd,test_first,${TASK_TYPE:-feature}").

  • Test count via grep: grep -cE 'PASS|FAIL|✓|✗|ok [0-9]' against the test log provides a coarse count. This is imprecise but consistent with how the pipeline already parses coverage (grep -oE '[0-9]+%' at line 711). Not worth a parsing overhaul for a display-only field.

  • Fail-open at every layer: declare -f + ruflo_available + || true + empty-string fallback. Three levels of protection, matching the established pattern.

Error Handling

Failure Mode Behavior
ruflo_recall returns empty No info log emitted, test proceeds normally
ruflo_recall errors Caught by `
ruflo_store errors on fail path Caught by `
ruflo_store errors on pass path Caught by `
ruflo_available returns false Entire block skipped
ruflo_recall/ruflo_store not defined declare -f guard skips block
SHIPWRIGHT_PIPELINE_ID unset Falls back to pipeline-unknown namespace

Alternatives Considered

  1. Store in learning-<repo_hash> namespace (like stage_test_first) — Pros: cross-pipeline recall; semantic matching works better with learning namespace. Cons: test results are pipeline-specific artifacts, not reusable TDD patterns; mixing them with TDD outcomes pollutes that namespace; the recall query would need different semantics. The pipeline namespace is the right scope for run-to-run flakiness data.

  2. Structured JSON storage instead of plain text — Pros: machine-parseable, enables richer future queries. Cons: ruflo_recall returns plain text (per feedback_ruflo_recall_plain_text.md), so JSON storage would require a separate retrieval+parse path; adds complexity for no current consumer; breaks the convention established by stage_build stores (lines 474, 498, 588) which all use plain text.

  3. Gate test execution on flakiness recall (skip known-flaky tests) — Pros: could reduce CI noise from known-flaky tests. Cons: far beyond scope; dangerous without human confirmation; test-skipping logic would need its own ADR; the recall is purely informational and that's the right starting point.

  4. Store test log content in ruflo — Pros: richer data for future recall. Cons: test logs can be large (thousands of lines); ruflo values should be concise summaries; the full log is already persisted at $ARTIFACTS_DIR/test-results.log.

Implementation Plan

  • Files to create: None
  • Files to modify:
    • scripts/lib/pipeline-stages-build.shstage_test() function: add recall block after line 618, add store in failure path before line 665, replace store in pass path at lines 718–723
    • scripts/sw-ruflo-adapter-test.sh — add 4 test sections before print_test_results (line 2530)
  • Dependencies: None — ruflo_recall, ruflo_store, ruflo_available are already sourced via ruflo-adapter.sh
  • Risk areas:
    • The failure path insertion point (before return 1 at line 665) is inside a nested if [[ -n "$ISSUE_NUMBER" ]] block for GitHub commenting. The ruflo store must be placed after that block but before return 1, so that GitHub comments still fire first.
    • grep -cE under pipefail returns exit code 1 when count is 0 — the || echo "0" handles this, but double-check it doesn't produce 0\n0 (the known pitfall documented in CLAUDE.md).

Validation Criteria

  • ruflo_recall is invoked before bash -c "$test_cmd" with query containing "test flakiness patterns failures" and namespace pipeline-<PIPELINE_ID>
  • Recall output is logged via info but does not prevent test execution (even on error)
  • ruflo_store fires on the fail path with tag failed, exit code, test count, cmd, and ISO8601 timestamp
  • ruflo_store fires on the pass path with tag passed, coverage %, test count, cmd, and timestamp
  • All ruflo calls guarded: declare -f ruflo_store, declare -f ruflo_available, ruflo_available, || true
  • When SHIPWRIGHT_PIPELINE_ID is unset, namespace defaults to pipeline-unknown
  • When ruflo_available returns false, no ruflo calls are made and test execution is unaffected
  • 4 new tests added to sw-ruflo-adapter-test.sh covering recall-happy, recall-logged, store-on-pass, store-on-fail
  • npm test passes with zero regressions

Test Pyramid Breakdown

  • Unit tests (4 new): Mock ruflo_recall/ruflo_store/ruflo_available as stub functions logging args to temp files. Assertions check logged arguments for correct namespace, tags, and data fields. Pattern mirrors existing stage_test_first tests at lines 2409–2528.
  • Integration tests (0): The ruflo adapter integration paths are already covered by existing adapter tests earlier in the file. No new integration boundary is introduced.
  • E2E tests (0): Pipeline-level E2E is a separate concern and not in scope.

Coverage Targets

  • 100% branch coverage of new code: recall-available, recall-unavailable, store-on-pass, store-on-fail
  • The "ruflo unavailable" branch is implicitly covered by the declare -f/ruflo_available guard pattern already tested at line 2440–2453

Critical Paths to Test

  • Happy path: Ruflo available → recall returns history → tests pass → store called with passed tag, coverage, count, cmd, timestamp
  • Error case 1: Tests fail → store called with failed tag, exit code, count, cmd, 0% coverage
  • Error case 2: Ruflo unavailable → recall returns empty, store skipped, test execution unaffected
  • Edge case 1: SHIPWRIGHT_PIPELINE_ID unset → namespace pipeline-unknown
  • Edge case 2: Recall returns empty string → no info log emitted, execution proceeds

Observability

Monitoring Checklist: Not applicable — this change adds informational logging to an existing pipeline stage. No new services, endpoints, or deployments are introduced. The recall and store results are visible in the existing pipeline log output via info calls.

Anomaly Detection Triggers: Not applicable — ruflo calls are fail-open with || true. A ruflo outage produces no pipeline-visible error, only a missing info line. No alerting threshold exists for "recall returned empty."

Log Analysis: After deployment, verify by searching pipeline logs for:

  • "Ruflo recall: historical test patterns found" — confirms recall is working
  • "stage-test-result" in ruflo memory — confirms store is persisting (check via ruflo_recall "stage-test-result" "pipeline-*")

Auto-Rollback Decision Criteria: Not applicable — this is a shell script change deployed via git merge, not a service deployment. Rollback = revert the commit. The fail-open design means a broken ruflo has zero impact on test execution.

Clone this wiki locally

AltStyle によって変換されたページ (->オリジナル) /