Pipeline Design 441

ezigus edited this page Apr 27, 2026 · 1 revision

Architecture Decision Record

Design: fix(harness): ruflo memory calls leak child Node processes

Context

Problem: The ruflo_with_timeout function in scripts/lib/ruflo-adapter.sh (lines 373–437) leaks child Node processes when timeouts fire. Each loop iteration leaks ~1.5 processes (385 orphaned procs ×ばつ ~4 MB = 1.5 GB cumulative over ~260 iterations).

Root Cause: When ruflo_with_timeout invokes a shell function that spawns Node, the Node process becomes a child of the wrapper subshell. Node then spawns grandchildren (agentdb workers, LLM daemon processes). The current cleanup uses pkill -TERM -P "$bg_pid" (line 416), which kills direct children only. Grandchildren remain alive with Node as parent; when Node is killed, grandchildren become orphans, are adopted by init(1), and accumulate in memory.

Constraints:

Must support bash 3.2 (macOS default) — no associative arrays, no modern shell features
Must be fail-open: timeout failures never block the pipeline
Must preserve temp file cleanup on all code paths
Must handle systems without pkill binary

Component Diagram

┌─────────────────────────────────────────────────────────────────┐
│ ruflo_with_timeout() │
│ (wrapper function responsible for timeout + cleanup) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Process Group Manager (setsid isolation) │ │
│ │ ├─ Creates isolated process group at spawn (line 404) │ │
│ │ └─ All descendants inherit group membership │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ├─ spawns ─────────────────────────┐ │
│ │ │
│ ┌──────────────────────────────────┐ │ │
│ │ Shell Subshell (bg_pid) │ │ │
│ │ • Creates new process group │ │ │
│ │ via setsid │ │ │
│ └──────────────────────────────────┘ │ │
│ │ │ │
│ ├─ execs ─────────────┐ │ │
│ │ │ │
│ ┌──────────────────────────────┐│ │ │
│ │ Ruflo/Node Process ││ │ │
│ │ • Same process group ││ │ │
│ │ • Parent = subshell ││ │ │
│ └──────────────────────────────┘│ │ │
│ │ │ │ │
│ ├─ spawns ──────────┐ │ │ │
│ │ │ │ │
│ ┌──────────────────────────────┐┌─┘ │ │ │
│ │ Grandchildren (agentdb, ││ │ │ │
│ │ LLM daemon, etc.) ││ │ │ │
│ │ • Same process group ││ │ │ │
│ │ • Parent = Node ││ │ │ │
│ └──────────────────────────────┘│ │ │ │
│ │ │ │ │
│ ┌──────────────────────────────────────────────────┴───┴──────┐│
│ │ Signal Delivery Layer ││
│ │ ├─ Timeout detected: kill -TERM -<negative_bg_pid> ││
│ │ │ (kills ENTIRE process group, not just direct children) ││
│ │ ├─ Grace period: sleep 1 ││
│ │ └─ Force kill: kill -KILL -<negative_bg_pid> ││
│ │ (ensures all descendants terminate) ││
│ └───────────────────────────────────────────────────────────┘│
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Resource Cleanup (guaranteed on all paths) │ │
│ │ ├─ rm -f "$_rft_tmp" (temp file) │ │
│ │ └─ wait $bg_pid (reap zombie) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘

Interface Contracts

Public API: `ruflo_with_timeout`

// Runs a command with a wall-clock timeout and graceful process group termination.
// STDIN/STDOUT/STDERR are captured in a temp file and printed on success.
// On timeout or failure, circuit-breaker increments RUFLO_FAILURE_COUNT.
function ruflo_with_timeout(
 timeout_seconds: number, // Wall-clock limit (1–3600 seconds)
 ...command: string[] // Command + args to run (shell function or binary)
): {
 exit_code: number; // 0 = success, 124 = timeout, >0 = command failure
 stdout: string; // Output from command (printed to stdout)
 side_effects: {
 RUFLO_FAILURE_COUNT: number; // Incremented on failure, exported
 RUFLO_AVAILABLE: boolean; // May flip to false if failure_count >= 5
 };
}
// Error contracts:
// - Returns 1 (non-zero) immediately if ruflo is disabled (fail-open)
// - Catches mktemp failures → returns 1, doesn't run command
// - Catches timeout → sends SIGTERM then SIGKILL to process group
// - Catches wait failures → returns exit code from child
// - ALWAYS cleans up temp file, even on error

Internal Helper: `_kill_process_group` (NEW)

// Send a signal to an entire process group (all descendants of a parent PID).
// POSIX standard: negative PID = process group ID.
// Fallback to pkill when kill -s <sig> -<pid> fails.
function _kill_process_group(
 signal: string, // "TERM" or "KILL" (uppercase, no SIG prefix)
 pid: number // Process ID (will be negated to form group ID)
): {
 exit_code: number; // 0 = signal sent, 1 = signal failed/not supported
 fallback_used: boolean; // true if kill -s failed and pkill was used
}

Invariants

Process Group Isolation: Subshell runs with setsid ⟹ all children inherit group ID
Signal Propagation: kill -s <signal> -<pid> sends signal to entire group, not just direct children
Cleanup Guarantee: Temp file removed regardless of code path (success, timeout, error)
Circuit Breaker: Failure counter increments atomically; disabled once counter ≥ RUFLO_MAX_FAILURES (default 5)

Data Flow

Success Path

1. ruflo_with_timeout(30, "shell_fn", "arg1")
 │
 ├─ mktemp → _rft_tmp created
 │
 ├─ ( setsid shell_fn "arg1" ) >_rft_tmp &
 │ └─ Subshell spawned, running in new process group
 │ ├─ Inherits setsid group isolation
 │ ├─ Spawns Node → inherited group
 │ └─ Node spawns grandchildren → inherited group
 │
 ├─ Poll: while kill -0 $bg_pid && waited < 30
 │ ├─ kill -0 checks process still alive (no signal sent)
 │ └─ sleep 1 per iteration
 │
 ├─ Process completes
 │ └─ wait $bg_pid → exit 0
 │
 ├─ cat _rft_tmp → print output
 │
 └─ rm -f _rft_tmp → cleanup, return 0

Timeout Path

1. Wait loop reaches timeout_s seconds
 │
 ├─ kill -0 $bg_pid → still alive after timeout
 │
 ├─ Send SIGTERM to entire process group:
 │ └─ kill -TERM -<negative_bg_pid> (negation sends to group)
 │
 ├─ Grace period: sleep 1
 │ ├─ Node gets SIGTERM, starts cleanup
 │ └─ Grandchildren get SIGTERM (same signal)
 │
 ├─ If still alive, send SIGKILL:
 │ └─ kill -KILL -<negative_bg_pid>
 │ ├─ Ungraceful termination
 │ └─ All descendants killed immediately
 │
 ├─ wait $bg_pid → reap zombie
 │
 ├─ rm -f _rft_tmp → cleanup
 │
 ├─ Increment RUFLO_FAILURE_COUNT (exported)
 │
 └─ return 124 (timeout exit code)

Fallback Path (no pkill/setsid)

1. If kill -TERM -<negative_pid> fails (not supported):
 │
 ├─ Call _kill_process_group with fallback:
 │ └─ pkill -TERM -P $bg_pid
 │ ├─ Kills direct children only (suboptimal)
 │ ├─ Grandchildren may survive
 │ └─ Better than nothing on systems without process group support
 │
 ├─ Emit warning: "Process group kill failed, using pkill fallback"
 │
 └─ Continue with grace period + SIGKILL

Error Boundaries

Component	Responsibility	Error Handling
Process Group Creator (setsid)	Isolate all descendants	If setsid unavailable: continue without group isolation; fallback to pkill -P
Timeout Watcher	Detect wall-clock expiry	Always succeeds (polling loop has failsafe bound)
Signal Sender	Kill process group with SIGTERM/SIGKILL	Errors caught; fallback to pkill; always proceeds with grace period
Process Reaper (wait)	Collect exit status, prevent zombies	Errors suppressed (process may already be dead); captured as exit code
Temp File Manager	Clean up FD resources	Always runs; errors suppressed to prevent abort-on-cleanup (fail-open)
Circuit Breaker	Disable ruflo after N failures	Increment atomic counter; exported immediately; threshold checked at next call

Error Propagation

Caller (e.g., ruflo_hive_init)
 │
 └─ ruflo_with_timeout(30, "some_cmd")
 │
 ├─ Any error inside → exit_code = 1 or 124
 │ └─ Caller receives non-zero exit
 │ └─ RUFLO_FAILURE_COUNT incremented (exported)
 │ └─ If counter ≥ 5, RUFLO_AVAILABLE set to false
 │
 └─ Caller decides: retry, skip, or disable ruflo
 (never aborts pipeline due to fail-open semantics)

Decision

Chosen Approach: Use POSIX process groups (setsid + negative PID kill) with explicit TERM→KILL sequence, bash 3.2 compatible.

Why This Design:

Correctness: Process groups capture ALL descendants (Node + grandchildren), not just direct children
Portability: POSIX standard; works on Linux, macOS, BSD — no GNU-only dependencies
Compatibility: Bash 3.2 supports setsid and negative PID in kill — no modern shell features required
Reliability: Two-phase kill (SIGTERM + grace + SIGKILL) ensures termination even if processes ignore SIGTERM
Safety: Fallback to pkill -P when process groups unavailable — degrades gracefully rather than leaking

Alternatives Considered

Alternative	Pros	Cons	Why Not Selected
Process groups + TERM→KILL (selected)	✓ Captures all descendants ✓ POSIX standard ✓ Bash 3.2 compatible	Slightly more code	Solves root cause, most reliable
Kill by name (`pkill -f "ruflo"`)	Simple, one line	Kills unrelated processes globally; huge blast radius	Too risky; no isolation
System `timeout` binary + `--kill-after`	GNU standard tool	macOS timeout(1) doesn't support --kill-after; requires fallback anyway	Forces fallback complexity; we already have one
EXIT trap cleanup only	Single cleanup point	Doesn't clean mid-loop; 1.5 GB leak persists until pipeline end	Doesn't solve the core issue
Resource limits (ulimit -p)	Prevents spawn	Doesn't clean existing orphans; breaks legitimate multi-process pipelines	Incomplete solution
Cgroup isolation (Linux only)	Perfect isolation	Not available on macOS; breaks portability of test suite	Lost macOS support

Implementation Plan

Phase 1: Core Fix

File: scripts/lib/ruflo-adapter.sh

Add _kill_process_group helper (lines 366–372, before ruflo_with_timeout)

_kill_process_group() {
 local _sig="${1:-TERM}" _pid="${2:-}"
 [[ -z "$_pid" ]] && return 1
 if kill -s "$_sig" -"$_pid" 2>/dev/null; then
 return 0
 elif declare -f pkill >/dev/null 2>&1; then
 pkill -"${_sig,,}" -P "$_pid" 2>/dev/null || true
 fi
 return 1
}

Add setsid to subprocess spawn (line 404)
- Change: ( "$@" ) >"$_rft_tmp" &
- To: ( setsid "$@" ) >"$_rft_tmp" &
Replace direct pkill -P with process group kill (line 416)
- Change: pkill -TERM -P "$bg_pid" 2>/dev/null || true
- To: _kill_process_group "TERM" "$bg_pid"
Add explicit TERM→KILL grace period (after line 416)
```
sleep 1
_kill_process_group "KILL" "$bg_pid"
```

Phase 2: Test Coverage

File: scripts/sw-ruflo-timeout-test.sh

Add Test 8: Run 10 timeout iterations, spawn grandchild per iteration, verify zero orphans
Update header: Note both issues #426 and #441 fixed

Phase 3: Integration Validation

File: scripts/sw-e2e-smoke-test.sh

Baseline measurement: Before first test, capture pgrep -c -f "node" count
Delta check: After final test, measure final count, assert delta ≤ 3 (allow variance)

Validation Criteria

Zero orphaned processes after timeout (verified by pgrep post-test)
All 8 tests pass in sw-ruflo-timeout-test.sh (including new Test 8)
No FD regression (Tests 1–7 still pass — issue #426 fixed)
Process delta ≤ 3 in smoke test (allowing natural variance)
Manual 10-iteration test: ≤1 proc leak per iteration (vs. ~1.48 before fix)
Bash 3.2 compatible — no modern shell constructs
Fail-open semantics preserved — timeouts never block pipeline
Temp file always cleaned — no FD leaks on any code path
Circuit breaker still functional — failure count increments, ruflo disabled after threshold

Risk Areas

Risk	Likelihood	Mitigation
`setsid` unavailable on some system	Low	Fallback: `pkill -P` still works (suboptimal but functional)
Negative PID unsupported	Very low	POSIX standard since 1988; all major systems support
Grace period too short (1s)	Very low	Node responds <100ms; 1s is conservative
Grace period blocks pipeline	Low	Worst-case +260s for full test run (~4 min); acceptable
Trap re-entrancy	None	Only called once per `ruflo_with_timeout` invocation
Flock contention	None	Process group kill ensures all locks released
Silent failure (no error signal)	Low	Emit warning if kill -s fails; continue anyway

Key Files

Modified: scripts/lib/ruflo-adapter.sh (lines 404, 416, +grace period, +helper)
Modified: scripts/sw-ruflo-timeout-test.sh (add Test 8)
Modified: scripts/sw-e2e-smoke-test.sh (baseline + delta check)
Related issues fixed: #426 (FD hang), #441 (process leak)

Pipeline Design 441

Architecture Decision Record

Design: fix(harness): ruflo memory calls leak child Node processes

Context

Component Diagram

Interface Contracts

Public API: ruflo_with_timeout

Internal Helper: _kill_process_group (NEW)

Invariants

Data Flow

Success Path

Timeout Path

Fallback Path (no pkill/setsid)

Error Boundaries

Error Propagation

Decision

Alternatives Considered

Implementation Plan

Phase 1: Core Fix

Phase 2: Test Coverage

Phase 3: Integration Validation

Validation Criteria

Risk Areas

Key Files

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Public API: `ruflo_with_timeout`

Internal Helper: `_kill_process_group` (NEW)