Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Pipeline Design 441

ezigus edited this page Apr 27, 2026 · 1 revision

Architecture Decision Record

Design: fix(harness): ruflo memory calls leak child Node processes


Context

Problem: The ruflo_with_timeout function in scripts/lib/ruflo-adapter.sh (lines 373–437) leaks child Node processes when timeouts fire. Each loop iteration leaks ~1.5 processes (385 orphaned procs ×ばつ ~4 MB = 1.5 GB cumulative over ~260 iterations).

Root Cause: When ruflo_with_timeout invokes a shell function that spawns Node, the Node process becomes a child of the wrapper subshell. Node then spawns grandchildren (agentdb workers, LLM daemon processes). The current cleanup uses pkill -TERM -P "$bg_pid" (line 416), which kills direct children only. Grandchildren remain alive with Node as parent; when Node is killed, grandchildren become orphans, are adopted by init(1), and accumulate in memory.

Constraints:

  • Must support bash 3.2 (macOS default) — no associative arrays, no modern shell features
  • Must be fail-open: timeout failures never block the pipeline
  • Must preserve temp file cleanup on all code paths
  • Must handle systems without pkill binary

Component Diagram

┌─────────────────────────────────────────────────────────────────┐
│ ruflo_with_timeout() │
│ (wrapper function responsible for timeout + cleanup) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Process Group Manager (setsid isolation) │ │
│ │ ├─ Creates isolated process group at spawn (line 404) │ │
│ │ └─ All descendants inherit group membership │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ├─ spawns ─────────────────────────┐ │
│ │ │
│ ┌──────────────────────────────────┐ │ │
│ │ Shell Subshell (bg_pid) │ │ │
│ │ • Creates new process group │ │ │
│ │ via setsid │ │ │
│ └──────────────────────────────────┘ │ │
│ │ │ │
│ ├─ execs ─────────────┐ │ │
│ │ │ │
│ ┌──────────────────────────────┐│ │ │
│ │ Ruflo/Node Process ││ │ │
│ │ • Same process group ││ │ │
│ │ • Parent = subshell ││ │ │
│ └──────────────────────────────┘│ │ │
│ │ │ │ │
│ ├─ spawns ──────────┐ │ │ │
│ │ │ │ │
│ ┌──────────────────────────────┐┌─┘ │ │ │
│ │ Grandchildren (agentdb, ││ │ │ │
│ │ LLM daemon, etc.) ││ │ │ │
│ │ • Same process group ││ │ │ │
│ │ • Parent = Node ││ │ │ │
│ └──────────────────────────────┘│ │ │ │
│ │ │ │ │
│ ┌──────────────────────────────────────────────────┴───┴──────┐│
│ │ Signal Delivery Layer ││
│ │ ├─ Timeout detected: kill -TERM -<negative_bg_pid> ││
│ │ │ (kills ENTIRE process group, not just direct children) ││
│ │ ├─ Grace period: sleep 1 ││
│ │ └─ Force kill: kill -KILL -<negative_bg_pid> ││
│ │ (ensures all descendants terminate) ││
│ └───────────────────────────────────────────────────────────┘│
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Resource Cleanup (guaranteed on all paths) │ │
│ │ ├─ rm -f "$_rft_tmp" (temp file) │ │
│ │ └─ wait $bg_pid (reap zombie) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘

Interface Contracts

Public API: ruflo_with_timeout

// Runs a command with a wall-clock timeout and graceful process group termination.
// STDIN/STDOUT/STDERR are captured in a temp file and printed on success.
// On timeout or failure, circuit-breaker increments RUFLO_FAILURE_COUNT.
function ruflo_with_timeout(
 timeout_seconds: number, // Wall-clock limit (1–3600 seconds)
 ...command: string[] // Command + args to run (shell function or binary)
): {
 exit_code: number; // 0 = success, 124 = timeout, >0 = command failure
 stdout: string; // Output from command (printed to stdout)
 side_effects: {
 RUFLO_FAILURE_COUNT: number; // Incremented on failure, exported
 RUFLO_AVAILABLE: boolean; // May flip to false if failure_count >= 5
 };
}
// Error contracts:
// - Returns 1 (non-zero) immediately if ruflo is disabled (fail-open)
// - Catches mktemp failures → returns 1, doesn't run command
// - Catches timeout → sends SIGTERM then SIGKILL to process group
// - Catches wait failures → returns exit code from child
// - ALWAYS cleans up temp file, even on error

Internal Helper: _kill_process_group (NEW)

// Send a signal to an entire process group (all descendants of a parent PID).
// POSIX standard: negative PID = process group ID.
// Fallback to pkill when kill -s <sig> -<pid> fails.
function _kill_process_group(
 signal: string, // "TERM" or "KILL" (uppercase, no SIG prefix)
 pid: number // Process ID (will be negated to form group ID)
): {
 exit_code: number; // 0 = signal sent, 1 = signal failed/not supported
 fallback_used: boolean; // true if kill -s failed and pkill was used
}

Invariants

  • Process Group Isolation: Subshell runs with setsid ⟹ all children inherit group ID
  • Signal Propagation: kill -s <signal> -<pid> sends signal to entire group, not just direct children
  • Cleanup Guarantee: Temp file removed regardless of code path (success, timeout, error)
  • Circuit Breaker: Failure counter increments atomically; disabled once counter ≥ RUFLO_MAX_FAILURES (default 5)

Data Flow

Success Path

1. ruflo_with_timeout(30, "shell_fn", "arg1")
 │
 ├─ mktemp → _rft_tmp created
 │
 ├─ ( setsid shell_fn "arg1" ) >_rft_tmp &
 │ └─ Subshell spawned, running in new process group
 │ ├─ Inherits setsid group isolation
 │ ├─ Spawns Node → inherited group
 │ └─ Node spawns grandchildren → inherited group
 │
 ├─ Poll: while kill -0 $bg_pid && waited < 30
 │ ├─ kill -0 checks process still alive (no signal sent)
 │ └─ sleep 1 per iteration
 │
 ├─ Process completes
 │ └─ wait $bg_pid → exit 0
 │
 ├─ cat _rft_tmp → print output
 │
 └─ rm -f _rft_tmp → cleanup, return 0

Timeout Path

1. Wait loop reaches timeout_s seconds
 │
 ├─ kill -0 $bg_pid → still alive after timeout
 │
 ├─ Send SIGTERM to entire process group:
 │ └─ kill -TERM -<negative_bg_pid> (negation sends to group)
 │
 ├─ Grace period: sleep 1
 │ ├─ Node gets SIGTERM, starts cleanup
 │ └─ Grandchildren get SIGTERM (same signal)
 │
 ├─ If still alive, send SIGKILL:
 │ └─ kill -KILL -<negative_bg_pid>
 │ ├─ Ungraceful termination
 │ └─ All descendants killed immediately
 │
 ├─ wait $bg_pid → reap zombie
 │
 ├─ rm -f _rft_tmp → cleanup
 │
 ├─ Increment RUFLO_FAILURE_COUNT (exported)
 │
 └─ return 124 (timeout exit code)

Fallback Path (no pkill/setsid)

1. If kill -TERM -<negative_pid> fails (not supported):
 │
 ├─ Call _kill_process_group with fallback:
 │ └─ pkill -TERM -P $bg_pid
 │ ├─ Kills direct children only (suboptimal)
 │ ├─ Grandchildren may survive
 │ └─ Better than nothing on systems without process group support
 │
 ├─ Emit warning: "Process group kill failed, using pkill fallback"
 │
 └─ Continue with grace period + SIGKILL

Error Boundaries

Component Responsibility Error Handling
Process Group Creator (setsid) Isolate all descendants If setsid unavailable: continue without group isolation; fallback to pkill -P
Timeout Watcher Detect wall-clock expiry Always succeeds (polling loop has failsafe bound)
Signal Sender Kill process group with SIGTERM/SIGKILL Errors caught; fallback to pkill; always proceeds with grace period
Process Reaper (wait) Collect exit status, prevent zombies Errors suppressed (process may already be dead); captured as exit code
Temp File Manager Clean up FD resources Always runs; errors suppressed to prevent abort-on-cleanup (fail-open)
Circuit Breaker Disable ruflo after N failures Increment atomic counter; exported immediately; threshold checked at next call

Error Propagation

Caller (e.g., ruflo_hive_init)
 │
 └─ ruflo_with_timeout(30, "some_cmd")
 │
 ├─ Any error inside → exit_code = 1 or 124
 │ └─ Caller receives non-zero exit
 │ └─ RUFLO_FAILURE_COUNT incremented (exported)
 │ └─ If counter ≥ 5, RUFLO_AVAILABLE set to false
 │
 └─ Caller decides: retry, skip, or disable ruflo
 (never aborts pipeline due to fail-open semantics)

Decision

Chosen Approach: Use POSIX process groups (setsid + negative PID kill) with explicit TERM→KILL sequence, bash 3.2 compatible.

Why This Design:

  1. Correctness: Process groups capture ALL descendants (Node + grandchildren), not just direct children
  2. Portability: POSIX standard; works on Linux, macOS, BSD — no GNU-only dependencies
  3. Compatibility: Bash 3.2 supports setsid and negative PID in kill — no modern shell features required
  4. Reliability: Two-phase kill (SIGTERM + grace + SIGKILL) ensures termination even if processes ignore SIGTERM
  5. Safety: Fallback to pkill -P when process groups unavailable — degrades gracefully rather than leaking

Alternatives Considered

Alternative Pros Cons Why Not Selected
Process groups + TERM→KILL (selected) ✓ Captures all descendants ✓ POSIX standard ✓ Bash 3.2 compatible Slightly more code Solves root cause, most reliable
Kill by name (pkill -f "ruflo") Simple, one line Kills unrelated processes globally; huge blast radius Too risky; no isolation
System timeout binary + --kill-after GNU standard tool macOS timeout(1) doesn't support --kill-after; requires fallback anyway Forces fallback complexity; we already have one
EXIT trap cleanup only Single cleanup point Doesn't clean mid-loop; 1.5 GB leak persists until pipeline end Doesn't solve the core issue
Resource limits (ulimit -p) Prevents spawn Doesn't clean existing orphans; breaks legitimate multi-process pipelines Incomplete solution
Cgroup isolation (Linux only) Perfect isolation Not available on macOS; breaks portability of test suite Lost macOS support

Implementation Plan

Phase 1: Core Fix

File: scripts/lib/ruflo-adapter.sh

  1. Add _kill_process_group helper (lines 366–372, before ruflo_with_timeout)

    _kill_process_group() {
     local _sig="${1:-TERM}" _pid="${2:-}"
     [[ -z "$_pid" ]] && return 1
     if kill -s "$_sig" -"$_pid" 2>/dev/null; then
     return 0
     elif declare -f pkill >/dev/null 2>&1; then
     pkill -"${_sig,,}" -P "$_pid" 2>/dev/null || true
     fi
     return 1
    }
  2. Add setsid to subprocess spawn (line 404)

    • Change: ( "$@" ) >"$_rft_tmp" &
    • To: ( setsid "$@" ) >"$_rft_tmp" &
  3. Replace direct pkill -P with process group kill (line 416)

    • Change: pkill -TERM -P "$bg_pid" 2>/dev/null || true
    • To: _kill_process_group "TERM" "$bg_pid"
  4. Add explicit TERM→KILL grace period (after line 416)

    sleep 1
    _kill_process_group "KILL" "$bg_pid"

Phase 2: Test Coverage

File: scripts/sw-ruflo-timeout-test.sh

  1. Add Test 8: Run 10 timeout iterations, spawn grandchild per iteration, verify zero orphans
  2. Update header: Note both issues #426 and #441 fixed

Phase 3: Integration Validation

File: scripts/sw-e2e-smoke-test.sh

  1. Baseline measurement: Before first test, capture pgrep -c -f "node" count
  2. Delta check: After final test, measure final count, assert delta ≤ 3 (allow variance)

Validation Criteria

  • Zero orphaned processes after timeout (verified by pgrep post-test)
  • All 8 tests pass in sw-ruflo-timeout-test.sh (including new Test 8)
  • No FD regression (Tests 1–7 still pass — issue #426 fixed)
  • Process delta ≤ 3 in smoke test (allowing natural variance)
  • Manual 10-iteration test: ≤1 proc leak per iteration (vs. ~1.48 before fix)
  • Bash 3.2 compatible — no modern shell constructs
  • Fail-open semantics preserved — timeouts never block pipeline
  • Temp file always cleaned — no FD leaks on any code path
  • Circuit breaker still functional — failure count increments, ruflo disabled after threshold

Risk Areas

Risk Likelihood Mitigation
setsid unavailable on some system Low Fallback: pkill -P still works (suboptimal but functional)
Negative PID unsupported Very low POSIX standard since 1988; all major systems support
Grace period too short (1s) Very low Node responds <100ms; 1s is conservative
Grace period blocks pipeline Low Worst-case +260s for full test run (~4 min); acceptable
Trap re-entrancy None Only called once per ruflo_with_timeout invocation
Flock contention None Process group kill ensures all locks released
Silent failure (no error signal) Low Emit warning if kill -s fails; continue anyway

Key Files

  • Modified: scripts/lib/ruflo-adapter.sh (lines 404, 416, +grace period, +helper)
  • Modified: scripts/sw-ruflo-timeout-test.sh (add Test 8)
  • Modified: scripts/sw-e2e-smoke-test.sh (baseline + delta check)
  • Related issues fixed: #426 (FD hang), #441 (process leak)

Clone this wiki locally

AltStyle によって変換されたページ (->オリジナル) /