Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Pipeline Plan 154

ezigus edited this page Mar 15, 2026 · 1 revision

Implementation Plan: Build Loop Context Exhaustion Prevention

Alternatives Considered

Approach A: Token-based threshold monitoring with proactive summarization (CHOSEN)

  • Monitor accumulated LOOP_INPUT_TOKENS + LOOP_OUTPUT_TOKENS per iteration against a configurable threshold (70% of model context window)
  • When threshold is crossed, generate a compressed state summary and trigger a session restart with that summary injected
  • Pros: Uses existing token tracking infrastructure, leverages existing session restart mechanism, minimal new code
  • Blast radius: Small — adds a new check in the main loop, a new summarization function, and a new module file
  • Trade-offs: Token counts from Claude CLI are per-iteration (not cumulative conversation context), so we estimate cumulative context usage

Approach B: Character-based prompt size tracking only

  • Only track prompt character growth across iterations
  • Pros: Simpler, no new dependency on token math
  • Cons: Doesn't account for Claude's internal conversation context accumulation; prompt size alone doesn't reflect total context usage
  • Rejected: Insufficient — the real exhaustion happens in Claude's conversation context, not just our injected prompt

Approach C: Claude CLI --max-tokens monitoring via stderr parsing

  • Parse Claude CLI stderr for context limit warnings
  • Pros: Most accurate
  • Cons: Fragile (depends on CLI output format), reactive not proactive
  • Rejected: We want proactive prevention, not reactive recovery

Design Decision

Approach A — Track cumulative token usage across iterations and proactively trigger summarization + session restart at 70% of the context window. This builds on:

  1. Existing accumulate_loop_tokens() in sw-loop.sh (already tracks per-iteration tokens)
  2. Existing run_loop_with_restarts() session restart mechanism
  3. Existing emit_event telemetry system

The key insight: each Claude CLI invocation gets the full conversation context. As iterations accumulate, the total tokens grow. We track cumulative input tokens as a proxy for context window usage and trigger preemptive summarization before hitting limits.

Files to Modify

File Action Purpose
scripts/lib/loop-context-monitor.sh CREATE New module: context threshold monitoring + state summarization
scripts/sw-loop.sh MODIFY Source new module, add context check in main loop, wire summarization into restart
scripts/lib/loop-iteration.sh MODIFY Add cumulative token tracking variable, emit enhanced context metrics
scripts/sw-loop-test.sh MODIFY Add test cases for context monitoring and summarization

Implementation Steps

Step 1: Create scripts/lib/loop-context-monitor.sh

New module with these functions:

# Module guard
_LOOP_CONTEXT_MONITOR_LOADED=1
# Constants
CONTEXT_WINDOW_TOKENS=${CONTEXT_WINDOW_TOKENS:-200000} # Default: 200k (Opus/Sonnet)
CONTEXT_EXHAUSTION_THRESHOLD=${CONTEXT_EXHAUSTION_THRESHOLD:-70} # Trigger at 70%
CONTEXT_SUMMARIZATION_TRIGGERED=false
# check_context_exhaustion()
# - Computes cumulative_tokens = LOOP_INPUT_TOKENS + LOOP_OUTPUT_TOKENS
# - Computes usage_pct = cumulative_tokens / CONTEXT_WINDOW_TOKENS * 100
# - If usage_pct >= CONTEXT_EXHAUSTION_THRESHOLD:
# - Emits loop.context_exhaustion_warning event
# - Returns 0 (threshold crossed)
# - Else returns 1 (safe)
# summarize_loop_state()
# - Writes compressed state to $LOG_DIR/context-summary.md:
# - Goal (original, not accumulated)
# - Iteration count and test status
# - Files modified (from git diff --name-only LOOP_START_COMMIT..HEAD)
# - Last error summary (from error-summary.json)
# - Key fixes attempted (from log entries, last 5)
# - Test results status
# - Returns path to summary file
# get_context_usage_pct()
# - Returns current context usage as integer percentage
# - Used by telemetry and logging

Step 2: Wire into main loop (sw-loop.sh)

  1. Source the new module near the top (after other lib sources)
  2. After accumulate_loop_tokens() call in the main loop (~line 2166), add:
    # Context exhaustion prevention
    if check_context_exhaustion; then
     warn "Context usage at $(get_context_usage_pct)% — triggering proactive summarization"
     summarize_loop_state
     STATUS="context_exhaustion"
     write_state
     write_progress
     break # Exit to restart wrapper
    fi
  3. In run_loop_with_restarts(), handle STATUS="context_exhaustion" as a restart-worthy condition (alongside "stuck_restart")
  4. When restarting after context exhaustion, inject the summary from context-summary.md into the restart context

Step 3: Enhanced token tracking in loop-iteration.sh

Add to run_claude_iteration() after the accumulate_loop_tokens call:

  • Emit loop.context_usage event with: cumulative_input, cumulative_output, usage_pct, threshold

Step 4: Summarization state preservation

In summarize_loop_state():

  • Read ORIGINAL_GOAL (already preserved in sw-loop.sh)
  • Read git diff stat from LOOP_START_COMMIT
  • Extract error patterns from error-summary.json
  • Extract last 5 log entries from LOG_ENTRIES
  • Write to $LOG_DIR/context-summary.md in structured format
  • The restart mechanism already copies error-summary.json and reads progress.md — we add context-summary.md to the restart context injection

Step 5: Restart integration

In run_loop_with_restarts() (~line 2389):

  • Add context_exhaustion to the list of restartable statuses
  • When restarting after context_exhaustion:
    • Inject context-summary.md content into GOAL as "## Previous Session Context (Summarized)"
    • Reset token counters
    • Emit loop.context_exhaustion_restart event

Step 6: Test coverage

Add to sw-loop-test.sh:

  1. Unit test: threshold calculation — Set LOOP_INPUT_TOKENS/OUTPUT_TOKENS to known values, verify check_context_exhaustion returns correctly at <70%, =70%, >70%
  2. Unit test: summarization output — Create mock state (git, error-summary.json, log entries), run summarize_loop_state, verify output contains essential fields
  3. Unit test: context window sizing — Verify CONTEXT_WINDOW_TOKENS defaults and override via env
  4. Integration test: restart trigger — Simulate tokens exceeding threshold, verify loop breaks with context_exhaustion status and emits correct event

Task Checklist

  • Task 1: Create scripts/lib/loop-context-monitor.sh with module guard, constants, check_context_exhaustion(), summarize_loop_state(), get_context_usage_pct()
  • Task 2: Source the new module in sw-loop.sh (near line 28 with other lib sources)
  • Task 3: Add context exhaustion check in main loop after accumulate_loop_tokens call (~line 2166 in run_single_agent_loop)
  • Task 4: Handle context_exhaustion status in run_loop_with_restarts() — allow restart with summary injection
  • Task 5: Add loop.context_exhaustion_warning and loop.context_exhaustion_restart event emissions
  • Task 6: Emit loop.context_usage event per iteration with cumulative token usage percentage
  • Task 7: Add threshold calculation unit tests to sw-loop-test.sh
  • Task 8: Add summarization output unit tests to sw-loop-test.sh
  • Task 9: Add restart trigger integration test to sw-loop-test.sh
  • Task 10: Verify existing tests still pass after changes

Testing Approach

Test Pyramid Breakdown

  • Unit tests (7): Threshold math at boundary values (<70%, =70%, >70%), summarization output validation (4 field checks), context window default/override
  • Integration tests (2): Full loop restart on context exhaustion, event emission verification
  • E2E tests (1): Existing sw-loop-test.sh regression (no breakage)

Coverage Targets

  • 100% branch coverage on check_context_exhaustion() (3 branches: under/at/over threshold)
  • 100% coverage on summarize_loop_state() output fields
  • Existing test suite remains green

Critical Paths to Test

  • Happy path: Loop runs under threshold, no summarization triggered
  • Error case 1: Tokens exceed 70% threshold mid-loop — summarization fires, loop breaks gracefully
  • Error case 2: Tokens exceed threshold on first iteration (huge prompt) — handled without crash
  • Edge case 1: Zero tokens reported (jq unavailable) — no false positive trigger
  • Edge case 2: CONTEXT_WINDOW_TOKENS set to 0 — division by zero protection

Risk Analysis

Risk Impact Mitigation
Token counts are per-iteration, not cumulative conversation context Underestimates true usage Accumulate across iterations; use conservative 70% threshold
False positive triggers (threshold too aggressive) Unnecessary restarts Make threshold configurable via env/config; default 70% is conservative
Summary too lossy — critical context dropped Regression after restart Include: goal, files modified, error patterns, test status, last 5 log entries
Division by zero if CONTEXT_WINDOW_TOKENS=0 Script crash Guard with [[ "$window" -gt 0 ]] check

Definition of Done

  • check_context_exhaustion() correctly identifies when cumulative tokens exceed 70% of context window
  • summarize_loop_state() produces compressed state with: goal, iteration count, modified files, error patterns, test status
  • Loop continues seamlessly after summarization-triggered restart without losing critical context
  • loop.context_exhaustion_warning event emitted when threshold crossed (observable in events.jsonl)
  • loop.context_exhaustion_restart event emitted when restart occurs
  • Per-iteration loop.context_usage event includes cumulative token percentage
  • All new code has test coverage (threshold boundaries, summarization output, restart trigger)
  • Existing test suite passes without regression
  • Bash 3.2 compatible (no associative arrays, no ${var,,})
  • Uses set -euo pipefail and module guard pattern

Threat Model (STRIDE)

  • Spoofing/Tampering/Repudiation/Elevation: Not applicable — this is internal shell logic, no auth or external input
  • Information Disclosure: context-summary.md contains goal/error text — same sensitivity as existing progress.md. No secrets involved.
  • Denial of Service: False positive summarization could cause unnecessary restarts. Mitigated by configurable threshold and conservative default.

Auth Flow

Not applicable — no authentication involved in this feature.

Input Validation Points

  • CONTEXT_WINDOW_TOKENS from env — validated as integer >0
  • CONTEXT_EXHAUSTION_THRESHOLD from env — validated as integer 1-99
  • Token values from accumulate_loop_tokens — already validated in existing code

Security Checklist

  • No secrets in code
  • No external input from users (internal orchestration only)
  • No network calls added
  • No file path injection risk (all paths derived from existing LOG_DIR/PROJECT_ROOT)

Monitoring Checklist

P0 — Immediate

  • loop.context_exhaustion_warning event fires when expected (threshold crossed)
  • Loop does not crash or hang when summarization triggers

P1 — Short-term

  • loop.context_usage events show monotonically increasing token percentages
  • Restart after context exhaustion produces working sessions (not stuck loops)

Anomaly Detection Triggers

  • context_exhaustion_warning firing on iteration 1 = prompt too large or threshold misconfigured
  • Multiple consecutive context_exhaustion_restart events = possible infinite restart loop (guarded by MAX_RESTARTS)

Log Analysis

  • Search for "context_exhaustion" in events.jsonl
  • Verify context-summary.md written before each exhaustion restart

Auto-Rollback Decision Criteria

Not applicable — this is build infrastructure, not a deployed service.

Systematic Debugging Notes

Root Cause Hypothesis (for potential failures)

  1. Token accumulation undercount — Claude CLI may not report all tokens. Likelihood: medium. Evidence: compare loop-tokens.json totals vs Claude dashboard.
  2. Threshold too aggressive — 70% may trigger too early for small tasks. Likelihood: low. Evidence: check if context_exhaustion events fire on simple 2-iteration loops.
  3. Summary injection bloats restart context — If summary is too large, it defeats the purpose. Likelihood: low. Mitigation: cap summary at 2000 chars.

Evidence to Gather

  • Token accumulation values across iterations (from existing loop.context_efficiency events)
  • Actual Claude context window sizes for different models

Fix Strategy

This is a new feature, not a retry of a failed approach. Building on proven patterns: module guard, emit_event, session restart.

Verification Plan

  1. Run sw-loop-test.sh — all tests pass including new ones
  2. Run npm test — full suite green
  3. Manual verification: set CONTEXT_WINDOW_TOKENS=1000, run loop, confirm early summarization trigger

Clone this wiki locally

AltStyle によって変換されたページ (->オリジナル) /