Token reduction + routing reliability, provider/cost fixes, native ABI guard #73

Original file line number	Diff line number	Diff line change
Expand Up		@@ -445,6 +445,17 @@ TOON_MIN_BYTES=4096
		TOON_FAIL_OPEN=true
		TOON_LOG_STATS=true

	# Model price overrides: pin per-1M-token USD prices for models the pricing
	# registry doesn't know (otherwise their cost is recorded as null/unknown).
	# JSON object keyed by model name. Example:
	# MODEL_PRICE_OVERRIDES={"my-model":{"input":0.5,"output":1.5}}

	# Caveman terse-output injection (opt-in): append a brevity instruction to the
	# system prompt to reduce OUTPUT tokens. Off by default — changes model style.
	# Levels: lite \| full \| ultra
	CAVEMAN_ENABLED=false
	CAVEMAN_LEVEL=lite

		# ==============================================================================
		# Tiered Model Routing (REQUIRED)
		# ==============================================================================
Expand Down

60 changes: 46 additions & 14 deletions README.md

Show comments View file Open in desktop

Original file line number	Diff line number	Diff line change
Expand Up		@@ -545,6 +545,28 @@ TOOL_INJECTION_ENABLED=false
		CODE_MODE_ENABLED=true
		```

	Always-on (no config): smart tool selection (server mode), **RTK tool-result
	compression (test/git/grep/lint/build/JSON output), MCP tool dedup** (drops
	built-in WebSearch/WebFetch when an Exa/Tavily MCP tool is present), and
	request bypass (Claude CLI Warmup / title-extraction calls are answered
	locally, never hitting a provider).

	Optional terse-output mode to cut output tokens:
	```bash
	CAVEMAN_ENABLED=true # off by default — nudges the model to be concise
	CAVEMAN_LEVEL=lite # lite \| full \| ultra
	```

	### Cost tracking & model pricing
	Per-request cost is computed from a model-pricing registry (LiteLLM → models.dev,
	cached 24h) and recorded in telemetry. Models the registry doesn't know record
	`cost_usd=null` (logged once) rather than a fabricated price. Pin prices for
	unknown models:
	```bash
	# Per-1M-token USD prices, JSON keyed by model name
	MODEL_PRICE_OVERRIDES={"my-model":{"input":0.5,"output":1.5}}
	```

		### Memory System (Titans-inspired)
		```bash
		MEMORY_ENABLED=true
Expand Down Expand Up		@@ -652,35 +674,45 @@ npm start

		## Benchmark Results

	Measured on real agentic coding workloads (Claude Code / Cursor sessions) with Ollama, Moonshot, and Azure OpenAI backends. Run with `node benchmark-tier-routing.js`.
	Head-to-head against LiteLLM on the same backends (Ollama `minimax-m2.5`, Moonshot, Azure OpenAI), 9 scenarios across 4 feature categories. Apples-to-apples comparison is Lynkr vs LiteLLM billed tokens on the same scenario. Run with `node benchmark-tier-routing.js`.

	### Token compression
	> _Run: June 5, 2026 · Lynkr v9.3.2 · LiteLLM v1.87.1 · macOS, Apple Silicon._

	\| Scenario \| Tokens without Lynkr \| Tokens with Lynkr \| Reduction \|
	### Token reduction (vs LiteLLM, same model & prompt)

	\| Mechanism \| Lynkr \| LiteLLM \| Result \|
		\|---\|---\|---\|---\|
	\| 14-tool request (read task) \| 1,042 \| 547 \| 47% \|
	\| 14-tool request (write task) \| 1,043 \| 412 \| 60% \|
	\| Large JSON grep result (60 items) \| 3,458 \| 427 \| 87.6% \|
	\| Smart tool selection (14 tools) \| 959 tokens · 0ドル.0044 \| 2,085 tokens · 0ドル.0091 \| 53% fewer tokens, 52% cheaper \|
	\| TOON compression (60-item grep JSON) \| 427 tokens · 0ドル.009 \| 3,458 tokens · 0ドル.018 \| 87.6% fewer tokens, 50% cheaper \|

	Lynkr strips irrelevant tool schemas before forwarding (smart tool selection) and binary-compresses large JSON tool results (TOON) — both happen in-process with no added latency.
	Lynkr strips irrelevant tool schemas (smart tool selection) and binary-compresses large JSON tool results (TOON) — both in-process, no added latency.

		### Semantic cache

		\| \| Tokens billed \| Response time \|
		\|---\|---\|---\|
		\| First call (cold) \| 2,857 \| 1,891ms \|
	\| Second call — paraphrased, cache hit \| 0 \| 171ms \|
	\| Second call — paraphrased, cache hit \| 0 (served from cache) \| 171ms (×ばつ faster) \|

	Near-identical prompts return cached responses in 171ms. Zero tokens billed on a cache hit.
	Near-identical prompts return cached responses in 171ms. Zero model tokens billed on a cache hit.

		### Tier routing

	\| Request \| Routed to \|
	\|---\|---\|
	\| "What does git stash do?" \| SIMPLE → local model (free) \|
	\| JWT vs cookies security analysis \| COMPLEX → cloud model (correct) \|
	\| Request \| Lynkr routes to \| LiteLLM routes to \|
	\|---\|---\|---\|
	\| "What does git stash do?" \| `minimax-m2.5` (local, free) \| Ollama (local) \|
	\| JWT vs cookies security analysis \| `moonshot` (cloud — correct) \| Ollama (local — wrong call) \|

	Lynkr scores each request on 15 dimensions (token count, code complexity, reasoning markers, risk signals, agentic patterns) and escalates automatically. LiteLLM's `cost-based-routing` sends everything to the cheapest model regardless of complexity.

	### Cost projection (100,000 requests/month, same backend)

	\| \| Monthly cost \| vs LiteLLM \|
	\|---\|---\|---\|
	\| LiteLLM \| ~818ドル \| baseline \|
	\| Lynkr \| ~409ドル \| ~50% cheaper \|

	Lynkr scores each request on 15 dimensions (token count, code complexity, reasoning markers, risk signals, agentic patterns) and routes automatically. No caller changes needed.
	_Based on a tool-heavy agentic session (TOON scenario). On equal footing — same provider, same model — Lynkr is cheaper due to token optimization._

		→ [Full benchmark report with methodology](BENCHMARK_REPORT.md)

Expand Down

4 changes: 2 additions & 2 deletions docs/index.html

Show comments View file Open in desktop

Original file line number	Diff line number	Diff line change
Expand Up		@@ -34,7 +34,7 @@
		"description": "Self-hosted LLM gateway for Claude Code, Cursor, and Codex. Compresses tokens before they hit the model.",
		"url": "https://github.com/Fast-Editor/Lynkr",
		"downloadUrl": "https://www.npmjs.com/package/lynkr",
	"softwareVersion": "9.3.2",
	"softwareVersion": "9.4.6",
		"author": { "@type": "Person", "name": "Vishal Veera Reddy", "url": "https://github.com/vishalveerareddy123" },
		"offers": { "@type": "Offer", "price": "0", "priceCurrency": "USD" },
		"keywords": "LLM gateway, Claude Code, Cursor, Ollama, AWS Bedrock, AI coding, self-hosted"
Expand Down Expand Up		@@ -72,7 +72,7 @@
		<div class="hero-grid">

		<div class="hero-left">
	<div class="hero-version">v9.3.2 — benchmarked in production</div>
	<div class="hero-version">v9.4.6 — benchmarked in production</div>

		<h1 class="hero-heading reveal">
		The LLM gateway<br>
Expand Down

4 changes: 2 additions & 2 deletions docs/index.md

Show comments View file Open in desktop

Original file line number	Diff line number	Diff line change
Expand Up		@@ -50,7 +50,7 @@
		"description": "Self-hosted LLM gateway server that enables Claude Code, Cursor, and AI coding tools to work with any LLM provider with 60-80% cost reduction.",
		"url": "https://github.com/Fast-Editor/Lynkr",
		"downloadUrl": "https://www.npmjs.com/package/lynkr",
	"softwareVersion": "9.3.2",
	"softwareVersion": "9.4.6",
		"author": {
		"@type": "Person",
		"name": "Vishal Veera Reddy",
Expand Down Expand Up		@@ -107,7 +107,7 @@
		<section class="hero">
		<div class="hero-badge">
		<span class="hero-badge-dot"></span>
	<span>v9.3.2 — Production Ready</span>
	<span>v9.4.6 — Production Ready</span>
		</div>

		<h1 class="hero-title">
Expand Down

57 changes: 55 additions & 2 deletions documentation/token-optimization.md

Show comments View file Open in desktop

Original file line number	Diff line number	Diff line change
Expand Up		@@ -12,6 +12,7 @@ Lynkr reduces tokens sent to the model through multiple independent mechanisms.
		\|---\|---\|---\|
		\| Smart tool selection \| 47–60% \| 14-tool request (read or write task) \|
		\| TOON JSON compression \| 87.6% \| Large grep/file-read tool result (60-item array) \|
	\| Tool-result compression (RTK) \| up to 87.6% \| grep/test/git/lint/build/log/JSON tool output \|
		\| Semantic cache \| 100% on hit, 171ms \| Paraphrased repeat query \|
		\| MCP Code Mode \| 96% \| 100+ MCP tool schemas → 4 meta-tools \|
		\| History compression \| up to 80% \| Long multi-turn sessions \|
Expand Down Expand Up		@@ -45,7 +46,7 @@ At 100,000 requests/month on a tool-heavy agentic workload, this translates to *

		---

	## 7 Optimization Phases
	## Optimization Phases

		### Phase 0: MCP Code Mode (96% reduction for MCP tools)

Expand Down Expand Up		@@ -283,6 +284,58 @@ HISTORY_SUMMARIZE_OLDER=true # Summarize older turns (default: true)

		---

	### Phase 7: Tool-Result Compression (up to 87.6% on tool output)

	Problem: Tool results dominate agentic token usage. A single `grep`, test run, `git diff`, or JSON API response can be thousands of tokens — most of it boilerplate the model doesn't need to reason over.

	Lynkr compresses `tool_result` blocks in-process before forwarding (no added latency), via two complementary mechanisms.

	#### 7a. RTK pattern compression

	Detects the shape of a tool result and rewrites it to a compact, information-preserving summary. Each detector only fires when it recognizes the format; unrecognized text passes through unchanged.

	\| Detector \| What it compresses \| Example outcome \|
	\|----------\|--------------------\|-----------------\|
	\| `test_output` \| jest/vitest/pytest/cargo/go test logs \| Keep the summary line + failures, drop passing-test noise \|
	\| `git_diff` \| `git diff` \| Per-file `+adds/-dels` with capped change lines \|
	\| `git_status` \| `git status` \| Branch + staged/modified/untracked lists \|
	\| `git_log` \| `git log` \| One line per commit (`<sha7> <subject> (author, date)`) \|
	\| `lint_output` \| eslint/tsc/ruff/clippy/biome \| Counts grouped by rule, not every occurrence \|
	\| `build_output` \| npm/cargo/webpack \| Errors + capped warnings + success line \|
	\| `container_output` \| docker/kubectl tables \| Header + first N rows + "+M more" \|
	\| `json_response` \| large JSON objects \| Structural skeleton (search/fetch results preserved) \|
	\| `grep_output` \| `grep`/`rg` (`file:line:content`) \| Grouped by file, capped at 10 matches/file \|
	\| `directory_listing` \| `ls`/`find`/`tree` \| Grouped by directory with counts \|
	\| `large_file` \| long source files \| Imports + signatures skeleton \|
	\| `dedup_log` \| repetitive logs \| Collapses consecutive duplicate lines \|
	\| `smart_truncate` \| very long unmatched output \| Keeps head + tail, drops the middle \|

	Tier-aware thresholds — compression only kicks in above a size that scales with the routing tier, so cheap models get aggressive compression and reasoning models get the full picture:

	\| Tier \| Compress if result exceeds \|
	\|------\|----------------------------\|
	\| SIMPLE \| 300 chars \|
	\| MEDIUM \| 800 chars \|
	\| COMPLEX \| 2,000 chars \|
	\| REASONING \| never \|

	Lossless recovery (tee): the full original is stashed for 5 minutes and a pointer (`[full: tee_...]`) is appended to the compressed result. The model — or you — can fetch the original via `GET /tee/:id` if the detail is actually needed.

	Always on (no configuration). Metrics: `GET /metrics/tool-compression`.

	#### 7b. TOON compression (binary JSON encoding)

	For large JSON tool results (arrays of objects, API payloads), TOON re-encodes the structure into a far denser representation than pretty-printed JSON — 87.6% reduction on a 60-item grep array in benchmarks. Plain text and small payloads are left untouched.

	```bash
	TOON_ENABLED=true # opt-in (default: false)
	TOON_MIN_BYTES=4096 # only compress payloads larger than this
	TOON_FAIL_OPEN=true # on any encode error, forward the original (default: true)
	TOON_LOG_STATS=true # log per-call compression stats
	```

	---

		### Phase 8: Headroom Context Compression (Optional, 47-92% reduction)

		Problem: Even with all other optimizations, large requests can still exceed context limits.
Expand All		@@ -308,7 +361,7 @@ HEADROOM_ENABLED=true

		## Combined Savings

	When all 8 phases work together:
	When all phases work together:

		Example Request Flow:

Expand Down

26 changes: 21 additions & 5 deletions install.sh

Show comments View file Open in desktop

Original file line number	Diff line number	Diff line change
Expand Up		@@ -108,8 +108,24 @@ clone_or_update() {
		install_dependencies() {
		print_info "Installing dependencies..."
		cd "$INSTALL_DIR"
	npm install --production
	# --omit=dev keeps optionalDependencies (better-sqlite3, hnswlib-node,
	# tree-sitter) which back telemetry, the memory store and routing ML.
	# The postinstall hook (scripts/check-native.js) verifies the native ABI
	# and rebuilds if Node was upgraded — best-effort, never fails the install.
	npm install --omit=dev
		print_success "Dependencies installed"

	# Native optional modules need a C/C++ toolchain only if no prebuilt binary
	# is available for this platform. They degrade gracefully if absent.
	if ! node -e "const D=require('better-sqlite3'); new D(':memory:').close()" >/dev/null 2>&1; then
	print_warning "Native module 'better-sqlite3' is not loadable."
	echo " Telemetry, the memory store and sessions need it. To enable:"
	echo " - Ensure a build toolchain is present (Xcode CLT on macOS, build-essential + python3 on Linux), then:"
	echo " - ${BLUE}cd $INSTALL_DIR && npm run rebuild-native${NC}"
	echo " Lynkr still runs without it (those features stay disabled)."
	else
	print_success "Native modules OK (telemetry, memory, sessions enabled)"
	fi
		}

		# Create default .env file
Expand All		@@ -131,7 +147,7 @@ create_env_file() {
		MODEL_PROVIDER=ollama

		# Server Configuration
	PORT=8080
	PORT=8081

		# Ollama Configuration (default for local development)
		OLLAMA_MODEL=qwen2.5-coder:7b
Expand Down Expand Up		@@ -161,7 +177,7 @@ EOF
		print_info "📝 Configuration ready! Key settings:"
		echo " • Default provider: Ollama (local, offline)"
		echo " • Memory system: Enabled (learns from conversations)"
	echo " • Port: 8080"
	echo " • Port: 8081"
		echo ""
		print_warning "To use cloud providers (Databricks/OpenAI/Azure):"
		echo " Edit: ${BLUE}nano $INSTALL_DIR/.env${NC}"
Expand Down Expand Up		@@ -220,7 +236,7 @@ print_next_steps() {
		echo " ${BLUE}lynkr${NC}"
		echo ""
		echo " 3. Configure Claude Code CLI:"
	echo " ${BLUE}export ANTHROPIC_BASE_URL=http://localhost:8080${NC}"
	echo " ${BLUE}export ANTHROPIC_BASE_URL=http://localhost:8081${NC}"
		echo " ${BLUE}claude${NC}"
		echo ""
		echo " ${YELLOW}Option B: Use Cloud Providers (Databricks/OpenAI/Azure)${NC}"
Expand All		@@ -238,7 +254,7 @@ print_next_steps() {
		echo " ${BLUE}lynkr${NC}"
		echo ""
		echo " 3. Configure Claude Code CLI:"
	echo " ${BLUE}export ANTHROPIC_BASE_URL=http://localhost:8080${NC}"
	echo " ${BLUE}export ANTHROPIC_BASE_URL=http://localhost:8081${NC}"
		echo " ${BLUE}export ANTHROPIC_API_KEY=any-non-empty-value${NC} ${GREEN}← Placeholder${NC}"
		echo " ${BLUE}claude${NC}"
		echo ""
Expand Down

4 changes: 3 additions & 1 deletion package.json

Show comments View file Open in desktop

Original file line number	Diff line number	Diff line change
Expand Up		@@ -8,13 +8,15 @@
		"lynkr-setup": "scripts/setup.js"
		},
		"scripts": {
	"postinstall": "node scripts/check-native.js",
	"rebuild-native": "node scripts/check-native.js",
		"prestart": "node -e \"if(process.env.HEADROOM_ENABLED==='true'&&process.env.HEADROOM_DOCKER_ENABLED!=='false'){process.exit(0)}else{process.exit(1)}\" && docker compose --profile headroom up -d --build headroom 2>/dev/null \|\| echo 'Headroom skipped (disabled or Docker not running)'",
		"start": "node index.js 2>&1 \| npx pino-pretty --sync",
		"stop": "node -e \"if(process.env.HEADROOM_ENABLED==='true'&&process.env.HEADROOM_DOCKER_ENABLED!=='false'){process.exit(0)}else{process.exit(1)}\" && docker compose --profile headroom down \|\| echo 'Headroom skipped (disabled or Docker not running)'",
		"dev": "nodemon index.js",
		"lint": "eslint src index.js",
		"test": "npm run test:unit && npm run test:performance",
	"test:unit": "DATABRICKS_API_KEY=test-key DATABRICKS_API_BASE=http://test.com node --test test/routing.test.js test/hybrid-routing-integration.test.js test/web-tools.test.js test/passthrough-mode.test.js test/openrouter-error-resilience.test.js test/format-conversion.test.js test/azure-openai-config.test.js test/azure-openai-format-conversion.test.js test/azure-openai-routing.test.js test/azure-openai-streaming.test.js test/azure-openai-error-resilience.test.js test/azure-openai-integration.test.js test/openai-integration.test.js test/toon-compression.test.js test/llamacpp-integration.test.js test/resilience.test.js test/telemetry-routing.test.js test/memory/store.test.js test/memory/surprise.test.js test/memory/extractor.test.js test/memory/search.test.js test/memory/retriever.test.js test/distill.test.js test/large-payload.test.js test/code-mode.test.js test/prompt-cache-injection.test.js test/risk-analyzer.test.js test/interaction-block.test.js test/preflight.test.js",
	"test:unit": "DATABRICKS_API_KEY=test-key DATABRICKS_API_BASE=http://test.com node --test test/routing.test.js test/hybrid-routing-integration.test.js test/web-tools.test.js test/passthrough-mode.test.js test/openrouter-error-resilience.test.js test/format-conversion.test.js test/azure-openai-config.test.js test/azure-openai-format-conversion.test.js test/azure-openai-routing.test.js test/azure-openai-streaming.test.js test/azure-openai-error-resilience.test.js test/azure-openai-integration.test.js test/openai-integration.test.js test/toon-compression.test.js test/llamacpp-integration.test.js test/resilience.test.js test/telemetry-routing.test.js test/memory/store.test.js test/memory/surprise.test.js test/memory/extractor.test.js test/memory/search.test.js test/memory/retriever.test.js test/distill.test.js test/large-payload.test.js test/code-mode.test.js test/prompt-cache-injection.test.js test/risk-analyzer.test.js test/interaction-block.test.js test/preflight.test.js test/token-reduction.test.js test/session-affinity.test.js test/model-registry-cost.test.js",
		"test:memory": "DATABRICKS_API_KEY=test-key DATABRICKS_API_BASE=http://test.com node --test test/memory/store.test.js test/memory/surprise.test.js test/memory/extractor.test.js test/memory/search.test.js test/memory/retriever.test.js",
		"test:new-features": "DATABRICKS_API_KEY=test-key DATABRICKS_API_BASE=http://test.com node --test test/passthrough-mode.test.js test/openrouter-error-resilience.test.js test/format-conversion.test.js",
		"test:performance": "DATABRICKS_API_KEY=test-key DATABRICKS_API_BASE=http://test.com node test/hybrid-routing-performance.test.js && DATABRICKS_API_KEY=test-key DATABRICKS_API_BASE=http://test.com node test/performance-tests.js",
Expand Down

14 changes: 13 additions & 1 deletion public/dashboard.html

Show comments View file Open in desktop

Original file line number	Diff line number	Diff line change
Expand Up		@@ -244,17 +244,29 @@
		const t = d.today;
		const s = d.stats;

	const tierLabel = t => t === 'default' ? 'default' : String(t).toLowerCase();
		const providerCards = d.providers.length === 0
		? `<p class="text-slate-500 text-sm">No providers configured</p>`
		: d.providers.map(p => `
		<div class="flex items-center justify-between bg-slate-700/50 rounded-lg px-4 py-3">
		<div class="flex items-center gap-2">
		<span class="status-dot ${providerDot(p.type)}"></span>
		<span class="text-sm font-medium text-slate-200">${p.name}</span>
	${(p.tiers \|\| []).map(t => `<span class="badge bg-slate-600/60 text-slate-300">${tierLabel(t)}</span>`).join('')}
		</div>
		<span class="text-xs ${p.type === 'local' ? 'text-green-400' : 'text-blue-400'}">${p.type}</span>
		</div>`).join('');

	const providerWarnings = (d.providerWarnings \|\| []).map(w => `
	<div class="flex items-center justify-between bg-amber-500/10 border border-amber-500/30 rounded-lg px-4 py-3">
	<div class="flex items-center gap-2">
	<span class="text-amber-400 text-sm">⚠</span>
	<span class="text-sm font-medium text-amber-200">${w.name}</span>
	${(w.tiers \|\| []).map(t => `<span class="badge bg-amber-500/20 text-amber-300">${tierLabel(t)}</span>`).join('')}
	</div>
	<span class="text-xs text-amber-400">no credentials</span>
	</div>`).join('');

		const recentRows = (d.recentRequests \|\| []).map(r => `
		<tr class="table-row border-b border-slate-700/50">
		<td class="py-2 px-3 text-xs text-slate-500">${fmt.ago(r.timestamp)}</td>
Expand All		@@ -279,7 +291,7 @@
		<!-- Providers -->
		${card(`
		<h3 class="text-sm font-semibold text-slate-300 mb-3">Configured Providers</h3>
	<div class="flex flex-col gap-2">${providerCards}</div>
	<div class="flex flex-col gap-2">${providerCards}${providerWarnings}</div>
		`)}

		<!-- 24h Stats -->
Expand Down

Original file line number	Diff line number	Diff line change
Expand Up		@@ -12,6 +12,7 @@ Lynkr reduces tokens sent to the model through multiple independent mechanisms.
		\|---\|---\|---\|
		\| Smart tool selection \| 47–60% \| 14-tool request (read or write task) \|
		\| TOON JSON compression \| 87.6% \| Large grep/file-read tool result (60-item array) \|
	\| Tool-result compression (RTK) \| up to 87.6% \| grep/test/git/lint/build/log/JSON tool output \|
		\| Semantic cache \| 100% on hit, 171ms \| Paraphrased repeat query \|
		\| MCP Code Mode \| 96% \| 100+ MCP tool schemas → 4 meta-tools \|
		\| History compression \| up to 80% \| Long multi-turn sessions \|
Expand Down Expand Up		@@ -45,7 +46,7 @@ At 100,000 requests/month on a tool-heavy agentic workload, this translates to *

		---

	## 7 Optimization Phases
	## Optimization Phases

		### Phase 0: MCP Code Mode (96% reduction for MCP tools)

Expand Down Expand Up		@@ -283,6 +284,58 @@ HISTORY_SUMMARIZE_OLDER=true # Summarize older turns (default: true)

		---

	### Phase 7: Tool-Result Compression (up to 87.6% on tool output)

	Problem: Tool results dominate agentic token usage. A single `grep`, test run, `git diff`, or JSON API response can be thousands of tokens — most of it boilerplate the model doesn't need to reason over.

	Lynkr compresses `tool_result` blocks in-process before forwarding (no added latency), via two complementary mechanisms.

	#### 7a. RTK pattern compression

	Detects the shape of a tool result and rewrites it to a compact, information-preserving summary. Each detector only fires when it recognizes the format; unrecognized text passes through unchanged.

	\| Detector \| What it compresses \| Example outcome \|
	\|----------\|--------------------\|-----------------\|
	\| `test_output` \| jest/vitest/pytest/cargo/go test logs \| Keep the summary line + failures, drop passing-test noise \|
	\| `git_diff` \| `git diff` \| Per-file `+adds/-dels` with capped change lines \|
	\| `git_status` \| `git status` \| Branch + staged/modified/untracked lists \|
	\| `git_log` \| `git log` \| One line per commit (`<sha7> <subject> (author, date)`) \|
	\| `lint_output` \| eslint/tsc/ruff/clippy/biome \| Counts grouped by rule, not every occurrence \|
	\| `build_output` \| npm/cargo/webpack \| Errors + capped warnings + success line \|
	\| `container_output` \| docker/kubectl tables \| Header + first N rows + "+M more" \|
	\| `json_response` \| large JSON objects \| Structural skeleton (search/fetch results preserved) \|
	\| `grep_output` \| `grep`/`rg` (`file:line:content`) \| Grouped by file, capped at 10 matches/file \|
	\| `directory_listing` \| `ls`/`find`/`tree` \| Grouped by directory with counts \|
	\| `large_file` \| long source files \| Imports + signatures skeleton \|
	\| `dedup_log` \| repetitive logs \| Collapses consecutive duplicate lines \|
	\| `smart_truncate` \| very long unmatched output \| Keeps head + tail, drops the middle \|

	Tier-aware thresholds — compression only kicks in above a size that scales with the routing tier, so cheap models get aggressive compression and reasoning models get the full picture:

	\| Tier \| Compress if result exceeds \|
	\|------\|----------------------------\|
	\| SIMPLE \| 300 chars \|
	\| MEDIUM \| 800 chars \|
	\| COMPLEX \| 2,000 chars \|
	\| REASONING \| never \|

	Lossless recovery (tee): the full original is stashed for 5 minutes and a pointer (`[full: tee_...]`) is appended to the compressed result. The model — or you — can fetch the original via `GET /tee/:id` if the detail is actually needed.

	Always on (no configuration). Metrics: `GET /metrics/tool-compression`.

	#### 7b. TOON compression (binary JSON encoding)

	For large JSON tool results (arrays of objects, API payloads), TOON re-encodes the structure into a far denser representation than pretty-printed JSON — 87.6% reduction on a 60-item grep array in benchmarks. Plain text and small payloads are left untouched.

	```bash
	TOON_ENABLED=true # opt-in (default: false)
	TOON_MIN_BYTES=4096 # only compress payloads larger than this
	TOON_FAIL_OPEN=true # on any encode error, forward the original (default: true)
	TOON_LOG_STATS=true # log per-call compression stats
	```

	---

		### Phase 8: Headroom Context Compression (Optional, 47-92% reduction)

		Problem: Even with all other optimizations, large requests can still exceed context limits.
Expand All		@@ -308,7 +361,7 @@ HEADROOM_ENABLED=true

		## Combined Savings

	When all 8 phases work together:
	When all phases work together:

		Example Request Flow:

Expand Down

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Token reduction + routing reliability, provider/cost fixes, native ABI guard #73

Uh oh!

Token reduction + routing reliability, provider/cost fixes, native ABI guard #73

Filter by extension

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!