Pipeline Plan 154

Jump to bottom

ezigus edited this page Mar 15, 2026 · 1 revision

Implementation Plan: Build Loop Context Exhaustion Prevention

Alternatives Considered

Approach A: Token-based threshold monitoring with proactive summarization (CHOSEN)

Monitor accumulated LOOP_INPUT_TOKENS + LOOP_OUTPUT_TOKENS per iteration against a configurable threshold (70% of model context window)
When threshold is crossed, generate a compressed state summary and trigger a session restart with that summary injected
Pros: Uses existing token tracking infrastructure, leverages existing session restart mechanism, minimal new code
Blast radius: Small — adds a new check in the main loop, a new summarization function, and a new module file
Trade-offs: Token counts from Claude CLI are per-iteration (not cumulative conversation context), so we estimate cumulative context usage

Approach B: Character-based prompt size tracking only

Only track prompt character growth across iterations
Pros: Simpler, no new dependency on token math
Cons: Doesn't account for Claude's internal conversation context accumulation; prompt size alone doesn't reflect total context usage
Rejected: Insufficient — the real exhaustion happens in Claude's conversation context, not just our injected prompt

Approach C: Claude CLI `--max-tokens` monitoring via stderr parsing

Parse Claude CLI stderr for context limit warnings
Pros: Most accurate
Cons: Fragile (depends on CLI output format), reactive not proactive
Rejected: We want proactive prevention, not reactive recovery

Design Decision

Approach A — Track cumulative token usage across iterations and proactively trigger summarization + session restart at 70% of the context window. This builds on:

Existing accumulate_loop_tokens() in sw-loop.sh (already tracks per-iteration tokens)
Existing run_loop_with_restarts() session restart mechanism
Existing emit_event telemetry system

The key insight: each Claude CLI invocation gets the full conversation context. As iterations accumulate, the total tokens grow. We track cumulative input tokens as a proxy for context window usage and trigger preemptive summarization before hitting limits.

Files to Modify

File	Action	Purpose
`scripts/lib/loop-context-monitor.sh`	CREATE	New module: context threshold monitoring + state summarization
`scripts/sw-loop.sh`	MODIFY	Source new module, add context check in main loop, wire summarization into restart
`scripts/lib/loop-iteration.sh`	MODIFY	Add cumulative token tracking variable, emit enhanced context metrics
`scripts/sw-loop-test.sh`	MODIFY	Add test cases for context monitoring and summarization

Implementation Steps

Step 1: Create `scripts/lib/loop-context-monitor.sh`

New module with these functions:

# Module guard
_LOOP_CONTEXT_MONITOR_LOADED=1
# Constants
CONTEXT_WINDOW_TOKENS=${CONTEXT_WINDOW_TOKENS:-200000} # Default: 200k (Opus/Sonnet)
CONTEXT_EXHAUSTION_THRESHOLD=${CONTEXT_EXHAUSTION_THRESHOLD:-70} # Trigger at 70%
CONTEXT_SUMMARIZATION_TRIGGERED=false
# check_context_exhaustion()
# - Computes cumulative_tokens = LOOP_INPUT_TOKENS + LOOP_OUTPUT_TOKENS
# - Computes usage_pct = cumulative_tokens / CONTEXT_WINDOW_TOKENS * 100
# - If usage_pct >= CONTEXT_EXHAUSTION_THRESHOLD:
# - Emits loop.context_exhaustion_warning event
# - Returns 0 (threshold crossed)
# - Else returns 1 (safe)
# summarize_loop_state()
# - Writes compressed state to $LOG_DIR/context-summary.md:
# - Goal (original, not accumulated)
# - Iteration count and test status
# - Files modified (from git diff --name-only LOOP_START_COMMIT..HEAD)
# - Last error summary (from error-summary.json)
# - Key fixes attempted (from log entries, last 5)
# - Test results status
# - Returns path to summary file
# get_context_usage_pct()
# - Returns current context usage as integer percentage
# - Used by telemetry and logging

Step 2: Wire into main loop (`sw-loop.sh`)

Source the new module near the top (after other lib sources)

After accumulate_loop_tokens() call in the main loop (~line 2166), add:

# Context exhaustion prevention
if check_context_exhaustion; then
 warn "Context usage at $(get_context_usage_pct)% — triggering proactive summarization"
 summarize_loop_state
 STATUS="context_exhaustion"
 write_state
 write_progress
 break # Exit to restart wrapper
fi

In run_loop_with_restarts(), handle STATUS="context_exhaustion" as a restart-worthy condition (alongside "stuck_restart")
When restarting after context exhaustion, inject the summary from context-summary.md into the restart context

Step 3: Enhanced token tracking in `loop-iteration.sh`

Add to run_claude_iteration() after the accumulate_loop_tokens call:

Emit loop.context_usage event with: cumulative_input, cumulative_output, usage_pct, threshold

Step 4: Summarization state preservation

In summarize_loop_state():

Read ORIGINAL_GOAL (already preserved in sw-loop.sh)
Read git diff stat from LOOP_START_COMMIT
Extract error patterns from error-summary.json
Extract last 5 log entries from LOG_ENTRIES
Write to $LOG_DIR/context-summary.md in structured format
The restart mechanism already copies error-summary.json and reads progress.md — we add context-summary.md to the restart context injection

Step 5: Restart integration

In run_loop_with_restarts() (~line 2389):

Add context_exhaustion to the list of restartable statuses
When restarting after context_exhaustion:
- Inject context-summary.md content into GOAL as "## Previous Session Context (Summarized)"
- Reset token counters
- Emit loop.context_exhaustion_restart event

Step 6: Test coverage

Add to sw-loop-test.sh:

Unit test: threshold calculation — Set LOOP_INPUT_TOKENS/OUTPUT_TOKENS to known values, verify check_context_exhaustion returns correctly at <70%, =70%, >70%
Unit test: summarization output — Create mock state (git, error-summary.json, log entries), run summarize_loop_state, verify output contains essential fields
Unit test: context window sizing — Verify CONTEXT_WINDOW_TOKENS defaults and override via env
Integration test: restart trigger — Simulate tokens exceeding threshold, verify loop breaks with context_exhaustion status and emits correct event

Task Checklist

Task 1: Create scripts/lib/loop-context-monitor.sh with module guard, constants, check_context_exhaustion(), summarize_loop_state(), get_context_usage_pct()
Task 2: Source the new module in sw-loop.sh (near line 28 with other lib sources)
Task 3: Add context exhaustion check in main loop after accumulate_loop_tokens call (~line 2166 in run_single_agent_loop)
Task 4: Handle context_exhaustion status in run_loop_with_restarts() — allow restart with summary injection
Task 5: Add loop.context_exhaustion_warning and loop.context_exhaustion_restart event emissions
Task 6: Emit loop.context_usage event per iteration with cumulative token usage percentage
Task 7: Add threshold calculation unit tests to sw-loop-test.sh
Task 8: Add summarization output unit tests to sw-loop-test.sh
Task 9: Add restart trigger integration test to sw-loop-test.sh
Task 10: Verify existing tests still pass after changes

Testing Approach

Test Pyramid Breakdown

Unit tests (7): Threshold math at boundary values (<70%, =70%, >70%), summarization output validation (4 field checks), context window default/override
Integration tests (2): Full loop restart on context exhaustion, event emission verification
E2E tests (1): Existing sw-loop-test.sh regression (no breakage)

Coverage Targets

100% branch coverage on check_context_exhaustion() (3 branches: under/at/over threshold)
100% coverage on summarize_loop_state() output fields
Existing test suite remains green

Critical Paths to Test

Happy path: Loop runs under threshold, no summarization triggered
Error case 1: Tokens exceed 70% threshold mid-loop — summarization fires, loop breaks gracefully
Error case 2: Tokens exceed threshold on first iteration (huge prompt) — handled without crash
Edge case 1: Zero tokens reported (jq unavailable) — no false positive trigger
Edge case 2: CONTEXT_WINDOW_TOKENS set to 0 — division by zero protection

Risk Analysis

Risk	Impact	Mitigation
Token counts are per-iteration, not cumulative conversation context	Underestimates true usage	Accumulate across iterations; use conservative 70% threshold
False positive triggers (threshold too aggressive)	Unnecessary restarts	Make threshold configurable via env/config; default 70% is conservative
Summary too lossy — critical context dropped	Regression after restart	Include: goal, files modified, error patterns, test status, last 5 log entries
Division by zero if CONTEXT_WINDOW_TOKENS=0	Script crash	Guard with `[[ "$window" -gt 0 ]]` check

Definition of Done

check_context_exhaustion() correctly identifies when cumulative tokens exceed 70% of context window
summarize_loop_state() produces compressed state with: goal, iteration count, modified files, error patterns, test status
Loop continues seamlessly after summarization-triggered restart without losing critical context
loop.context_exhaustion_warning event emitted when threshold crossed (observable in events.jsonl)
loop.context_exhaustion_restart event emitted when restart occurs
Per-iteration loop.context_usage event includes cumulative token percentage
All new code has test coverage (threshold boundaries, summarization output, restart trigger)
Existing test suite passes without regression
Bash 3.2 compatible (no associative arrays, no ${var,,})
Uses set -euo pipefail and module guard pattern

Threat Model (STRIDE)

Spoofing/Tampering/Repudiation/Elevation: Not applicable — this is internal shell logic, no auth or external input
Information Disclosure: context-summary.md contains goal/error text — same sensitivity as existing progress.md. No secrets involved.
Denial of Service: False positive summarization could cause unnecessary restarts. Mitigated by configurable threshold and conservative default.

Auth Flow

Not applicable — no authentication involved in this feature.

Input Validation Points

CONTEXT_WINDOW_TOKENS from env — validated as integer >0
CONTEXT_EXHAUSTION_THRESHOLD from env — validated as integer 1-99
Token values from accumulate_loop_tokens — already validated in existing code

Security Checklist

No secrets in code
No external input from users (internal orchestration only)
No network calls added
No file path injection risk (all paths derived from existing LOG_DIR/PROJECT_ROOT)

Monitoring Checklist

P0 — Immediate

loop.context_exhaustion_warning event fires when expected (threshold crossed)
Loop does not crash or hang when summarization triggers

P1 — Short-term

loop.context_usage events show monotonically increasing token percentages
Restart after context exhaustion produces working sessions (not stuck loops)

Anomaly Detection Triggers

context_exhaustion_warning firing on iteration 1 = prompt too large or threshold misconfigured
Multiple consecutive context_exhaustion_restart events = possible infinite restart loop (guarded by MAX_RESTARTS)

Log Analysis

Search for "context_exhaustion" in events.jsonl
Verify context-summary.md written before each exhaustion restart

Auto-Rollback Decision Criteria

Not applicable — this is build infrastructure, not a deployed service.

Systematic Debugging Notes

Root Cause Hypothesis (for potential failures)

Token accumulation undercount — Claude CLI may not report all tokens. Likelihood: medium. Evidence: compare loop-tokens.json totals vs Claude dashboard.
Threshold too aggressive — 70% may trigger too early for small tasks. Likelihood: low. Evidence: check if context_exhaustion events fire on simple 2-iteration loops.
Summary injection bloats restart context — If summary is too large, it defeats the purpose. Likelihood: low. Mitigation: cap summary at 2000 chars.

Evidence to Gather

Token accumulation values across iterations (from existing loop.context_efficiency events)
Actual Claude context window sizes for different models

Fix Strategy

This is a new feature, not a retry of a failed approach. Building on proven patterns: module guard, emit_event, session restart.

Verification Plan

Run sw-loop-test.sh — all tests pass including new ones
Run npm test — full suite green
Manual verification: set CONTEXT_WINDOW_TOKENS=1000, run loop, confirm early summarization trigger

Pipeline Plan 154

Implementation Plan: Build Loop Context Exhaustion Prevention

Alternatives Considered

Approach A: Token-based threshold monitoring with proactive summarization (CHOSEN)

Approach B: Character-based prompt size tracking only

Approach C: Claude CLI --max-tokens monitoring via stderr parsing

Design Decision

Files to Modify

Implementation Steps

Step 1: Create scripts/lib/loop-context-monitor.sh

Step 2: Wire into main loop (sw-loop.sh)

Step 3: Enhanced token tracking in loop-iteration.sh

Step 4: Summarization state preservation

Step 5: Restart integration

Step 6: Test coverage

Task Checklist

Testing Approach

Test Pyramid Breakdown

Coverage Targets

Critical Paths to Test

Risk Analysis

Definition of Done

Threat Model (STRIDE)

Auth Flow

Input Validation Points

Security Checklist

Monitoring Checklist

P0 — Immediate

P1 — Short-term

Anomaly Detection Triggers

Log Analysis

Auto-Rollback Decision Criteria

Systematic Debugging Notes

Root Cause Hypothesis (for potential failures)

Evidence to Gather

Fix Strategy

Verification Plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Approach C: Claude CLI `--max-tokens` monitoring via stderr parsing

Step 1: Create `scripts/lib/loop-context-monitor.sh`

Step 2: Wire into main loop (`sw-loop.sh`)

Step 3: Enhanced token tracking in `loop-iteration.sh`