Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Token reduction + routing reliability, provider/cost fixes, native ABI guard #73

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
veerareddyvishal144 merged 10 commits into main from feat/token-reduction-rtk
Jun 11, 2026
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
10 commits
Select commit Hold shift + click to select a range
fa58992
feat(token-reduction): RTK filters, request bypass, MCP tool dedup, c...
Jun 9, 2026
9cf8f64
fix(routing): de-escalate risk false-positives and add session→provid...
Jun 9, 2026
7dabc47
fix(orchestrator): pass client tools through, wire bypass/dedup/affinity
Jun 9, 2026
735eb46
fix(providers,telemetry): Moonshot k2 params, cost capture, native AB...
Jun 9, 2026
dc948c9
feat(dashboard): tier-aware Configured Providers panel
Jun 9, 2026
320e5ee
fix(model-registry): deterministic cost resolution; drop fuzzy substr...
Jun 9, 2026
7e7b643
docs: sync benchmark numbers from BENCHMARK_REPORT, document new feat...
Jun 9, 2026
380d905
docs(token-optimization): add Phase 7 — tool-result compression (RTK ...
Jun 9, 2026
3c41c4f
chore(install): fix port to 8081, surface native-module status
Jun 9, 2026
97c5e0b
docs(homepage): bump displayed version to 9.4.6 (latest npm)
Jun 10, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .env.example
View file Open in desktop
Original file line number Diff line number Diff line change
Expand Up @@ -445,6 +445,17 @@ TOON_MIN_BYTES=4096
TOON_FAIL_OPEN=true
TOON_LOG_STATS=true

# Model price overrides: pin per-1M-token USD prices for models the pricing
# registry doesn't know (otherwise their cost is recorded as null/unknown).
# JSON object keyed by model name. Example:
# MODEL_PRICE_OVERRIDES={"my-model":{"input":0.5,"output":1.5}}

# Caveman terse-output injection (opt-in): append a brevity instruction to the
# system prompt to reduce OUTPUT tokens. Off by default — changes model style.
# Levels: lite | full | ultra
CAVEMAN_ENABLED=false
CAVEMAN_LEVEL=lite

# ==============================================================================
# Tiered Model Routing (REQUIRED)
# ==============================================================================
Expand Down
60 changes: 46 additions & 14 deletions README.md
View file Open in desktop
Original file line number Diff line number Diff line change
Expand Up @@ -545,6 +545,28 @@ TOOL_INJECTION_ENABLED=false
CODE_MODE_ENABLED=true
```

Always-on (no config): **smart tool selection** (server mode), **RTK tool-result
compression** (test/git/grep/lint/build/JSON output), **MCP tool dedup** (drops
built-in WebSearch/WebFetch when an Exa/Tavily MCP tool is present), and
**request bypass** (Claude CLI Warmup / title-extraction calls are answered
locally, never hitting a provider).

Optional **terse-output mode** to cut *output* tokens:
```bash
CAVEMAN_ENABLED=true # off by default — nudges the model to be concise
CAVEMAN_LEVEL=lite # lite | full | ultra
```

### Cost tracking & model pricing
Per-request cost is computed from a model-pricing registry (LiteLLM → models.dev,
cached 24h) and recorded in telemetry. Models the registry doesn't know record
`cost_usd=null` (logged once) rather than a fabricated price. Pin prices for
unknown models:
```bash
# Per-1M-token USD prices, JSON keyed by model name
MODEL_PRICE_OVERRIDES={"my-model":{"input":0.5,"output":1.5}}
```

### Memory System (Titans-inspired)
```bash
MEMORY_ENABLED=true
Expand Down Expand Up @@ -652,35 +674,45 @@ npm start

## Benchmark Results

Measured on real agentic coding workloads (Claude Code / Cursor sessions) with Ollama, Moonshot, and Azure OpenAI backends. Run with `node benchmark-tier-routing.js`.
Head-to-head against **LiteLLM** on the **same backends** (Ollama `minimax-m2.5`, Moonshot, Azure OpenAI), 9 scenarios across 4 feature categories. Apples-to-apples comparison is Lynkr vs LiteLLM **billed tokens on the same scenario**. Run with `node benchmark-tier-routing.js`.

### Token compression
> _Run: June 5, 2026 · Lynkr v9.3.2 · LiteLLM v1.87.1 · macOS, Apple Silicon._

| Scenario | Tokens without Lynkr | Tokens with Lynkr | Reduction |
### Token reduction (vs LiteLLM, same model & prompt)

| Mechanism | Lynkr | LiteLLM | Result |
|---|---|---|---|
| 14-tool request (read task) | 1,042 | **547** | **47%** |
| 14-tool request (write task) | 1,043 | **412** | **60%** |
| Large JSON grep result (60 items) | 3,458 | **427** | **87.6%** |
| Smart tool selection (14 tools) | **959** tokens · 0ドル.0044 | 2,085 tokens · 0ドル.0091 | **53% fewer tokens, 52% cheaper** |
| TOON compression (60-item grep JSON) | **427** tokens · 0ドル.009 | 3,458 tokens · 0ドル.018 | **87.6% fewer tokens, 50% cheaper** |

Lynkr strips irrelevant tool schemas before forwarding (smart tool selection) and binary-compresses large JSON tool results (TOON) — both happen in-process with no added latency.
Lynkr strips irrelevant tool schemas (smart tool selection) and binary-compresses large JSON tool results (TOON) — both in-process, no added latency.

### Semantic cache

| | Tokens billed | Response time |
|---|---|---|
| First call (cold) | 2,857 | 1,891ms |
| **Second call — paraphrased, cache hit** | **0** | **171ms** |
| **Second call — paraphrased, cache hit** | **0** (served from cache) | **171ms (×ばつ faster)** |

Near-identical prompts return cached responses in 171ms. Zero tokens billed on a cache hit.
Near-identical prompts return cached responses in 171ms. Zero model tokens billed on a cache hit.

### Tier routing

| Request | Routed to |
|---|---|
| "What does git stash do?" | SIMPLE → local model (free) |
| JWT vs cookies security analysis | COMPLEX → cloud model (correct) |
| Request | Lynkr routes to | LiteLLM routes to |
|---|---|---|
| "What does git stash do?" | `minimax-m2.5` (local, free) | Ollama (local) |
| JWT vs cookies security analysis | `moonshot` (cloud — correct) | **Ollama (local — wrong call)** |

Lynkr scores each request on 15 dimensions (token count, code complexity, reasoning markers, risk signals, agentic patterns) and escalates automatically. LiteLLM's `cost-based-routing` sends everything to the cheapest model regardless of complexity.

### Cost projection (100,000 requests/month, same backend)

| | Monthly cost | vs LiteLLM |
|---|---|---|
| LiteLLM | ~818ドル | baseline |
| **Lynkr** | **~409ドル** | **~50% cheaper** |

Lynkr scores each request on 15 dimensions (token count, code complexity, reasoning markers, risk signals, agentic patterns) and routes automatically. No caller changes needed.
_Based on a tool-heavy agentic session (TOON scenario). On equal footing — same provider, same model — Lynkr is cheaper due to token optimization._

→ [Full benchmark report with methodology](BENCHMARK_REPORT.md)

Expand Down
4 changes: 2 additions & 2 deletions docs/index.html
View file Open in desktop
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
"description": "Self-hosted LLM gateway for Claude Code, Cursor, and Codex. Compresses tokens before they hit the model.",
"url": "https://github.com/Fast-Editor/Lynkr",
"downloadUrl": "https://www.npmjs.com/package/lynkr",
"softwareVersion": "9.3.2",
"softwareVersion": "9.4.6",
"author": { "@type": "Person", "name": "Vishal Veera Reddy", "url": "https://github.com/vishalveerareddy123" },
"offers": { "@type": "Offer", "price": "0", "priceCurrency": "USD" },
"keywords": "LLM gateway, Claude Code, Cursor, Ollama, AWS Bedrock, AI coding, self-hosted"
Expand Down Expand Up @@ -72,7 +72,7 @@
<div class="hero-grid">

<div class="hero-left">
<div class="hero-version">v9.3.2 — benchmarked in production</div>
<div class="hero-version">v9.4.6 — benchmarked in production</div>

<h1 class="hero-heading reveal">
The LLM gateway<br>
Expand Down
4 changes: 2 additions & 2 deletions docs/index.md
View file Open in desktop
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@
"description": "Self-hosted LLM gateway server that enables Claude Code, Cursor, and AI coding tools to work with any LLM provider with 60-80% cost reduction.",
"url": "https://github.com/Fast-Editor/Lynkr",
"downloadUrl": "https://www.npmjs.com/package/lynkr",
"softwareVersion": "9.3.2",
"softwareVersion": "9.4.6",
"author": {
"@type": "Person",
"name": "Vishal Veera Reddy",
Expand Down Expand Up @@ -107,7 +107,7 @@
<section class="hero">
<div class="hero-badge">
<span class="hero-badge-dot"></span>
<span>v9.3.2 — Production Ready</span>
<span>v9.4.6 — Production Ready</span>
</div>

<h1 class="hero-title">
Expand Down
57 changes: 55 additions & 2 deletions documentation/token-optimization.md
View file Open in desktop
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Lynkr reduces tokens sent to the model through multiple independent mechanisms.
|---|---|---|
| **Smart tool selection** | **47–60%** | 14-tool request (read or write task) |
| **TOON JSON compression** | **87.6%** | Large grep/file-read tool result (60-item array) |
| **Tool-result compression (RTK)** | up to **87.6%** | grep/test/git/lint/build/log/JSON tool output |
| **Semantic cache** | **100% on hit, 171ms** | Paraphrased repeat query |
| MCP Code Mode | **96%** | 100+ MCP tool schemas → 4 meta-tools |
| History compression | up to 80% | Long multi-turn sessions |
Expand Down Expand Up @@ -45,7 +46,7 @@ At 100,000 requests/month on a tool-heavy agentic workload, this translates to *

---

## 7 Optimization Phases
## Optimization Phases

### Phase 0: MCP Code Mode (96% reduction for MCP tools)

Expand Down Expand Up @@ -283,6 +284,58 @@ HISTORY_SUMMARIZE_OLDER=true # Summarize older turns (default: true)

---

### Phase 7: Tool-Result Compression (up to 87.6% on tool output)

**Problem:** Tool results dominate agentic token usage. A single `grep`, test run, `git diff`, or JSON API response can be thousands of tokens — most of it boilerplate the model doesn't need to reason over.

Lynkr compresses `tool_result` blocks **in-process before forwarding** (no added latency), via two complementary mechanisms.

#### 7a. RTK pattern compression

Detects the *shape* of a tool result and rewrites it to a compact, information-preserving summary. Each detector only fires when it recognizes the format; unrecognized text passes through unchanged.

| Detector | What it compresses | Example outcome |
|----------|--------------------|-----------------|
| `test_output` | jest/vitest/pytest/cargo/go test logs | Keep the summary line + failures, drop passing-test noise |
| `git_diff` | `git diff` | Per-file `+adds/-dels` with capped change lines |
| `git_status` | `git status` | Branch + staged/modified/untracked lists |
| `git_log` | `git log` | One line per commit (`<sha7> <subject> (author, date)`) |
| `lint_output` | eslint/tsc/ruff/clippy/biome | Counts grouped by rule, not every occurrence |
| `build_output` | npm/cargo/webpack | Errors + capped warnings + success line |
| `container_output` | docker/kubectl tables | Header + first N rows + "+M more" |
| `json_response` | large JSON objects | Structural skeleton (search/fetch results preserved) |
| `grep_output` | `grep`/`rg` (`file:line:content`) | Grouped by file, capped at 10 matches/file |
| `directory_listing` | `ls`/`find`/`tree` | Grouped by directory with counts |
| `large_file` | long source files | Imports + signatures skeleton |
| `dedup_log` | repetitive logs | Collapses consecutive duplicate lines |
| `smart_truncate` | very long unmatched output | Keeps head + tail, drops the middle |

**Tier-aware thresholds** — compression only kicks in above a size that scales with the routing tier, so cheap models get aggressive compression and reasoning models get the full picture:

| Tier | Compress if result exceeds |
|------|----------------------------|
| SIMPLE | 300 chars |
| MEDIUM | 800 chars |
| COMPLEX | 2,000 chars |
| REASONING | never |

**Lossless recovery (tee):** the full original is stashed for 5 minutes and a pointer (`[full: tee_...]`) is appended to the compressed result. The model — or you — can fetch the original via `GET /tee/:id` if the detail is actually needed.

Always on (no configuration). Metrics: `GET /metrics/tool-compression`.

#### 7b. TOON compression (binary JSON encoding)

For large JSON tool results (arrays of objects, API payloads), TOON re-encodes the structure into a far denser representation than pretty-printed JSON — **87.6% reduction** on a 60-item grep array in benchmarks. Plain text and small payloads are left untouched.

```bash
TOON_ENABLED=true # opt-in (default: false)
TOON_MIN_BYTES=4096 # only compress payloads larger than this
TOON_FAIL_OPEN=true # on any encode error, forward the original (default: true)
TOON_LOG_STATS=true # log per-call compression stats
```

---

### Phase 8: Headroom Context Compression (Optional, 47-92% reduction)

**Problem:** Even with all other optimizations, large requests can still exceed context limits.
Expand All @@ -308,7 +361,7 @@ HEADROOM_ENABLED=true

## Combined Savings

When all 8 phases work together:
When all phases work together:

**Example Request Flow:**

Expand Down
26 changes: 21 additions & 5 deletions install.sh
View file Open in desktop
Original file line number Diff line number Diff line change
Expand Up @@ -108,8 +108,24 @@ clone_or_update() {
install_dependencies() {
print_info "Installing dependencies..."
cd "$INSTALL_DIR"
npm install --production
# --omit=dev keeps optionalDependencies (better-sqlite3, hnswlib-node,
# tree-sitter) which back telemetry, the memory store and routing ML.
# The postinstall hook (scripts/check-native.js) verifies the native ABI
# and rebuilds if Node was upgraded — best-effort, never fails the install.
npm install --omit=dev
print_success "Dependencies installed"

# Native optional modules need a C/C++ toolchain only if no prebuilt binary
# is available for this platform. They degrade gracefully if absent.
if ! node -e "const D=require('better-sqlite3'); new D(':memory:').close()" >/dev/null 2>&1; then
print_warning "Native module 'better-sqlite3' is not loadable."
echo " Telemetry, the memory store and sessions need it. To enable:"
echo " - Ensure a build toolchain is present (Xcode CLT on macOS, build-essential + python3 on Linux), then:"
echo " - ${BLUE}cd $INSTALL_DIR && npm run rebuild-native${NC}"
echo " Lynkr still runs without it (those features stay disabled)."
else
print_success "Native modules OK (telemetry, memory, sessions enabled)"
fi
}

# Create default .env file
Expand All @@ -131,7 +147,7 @@ create_env_file() {
MODEL_PROVIDER=ollama

# Server Configuration
PORT=8080
PORT=8081

# Ollama Configuration (default for local development)
OLLAMA_MODEL=qwen2.5-coder:7b
Expand Down Expand Up @@ -161,7 +177,7 @@ EOF
print_info "📝 Configuration ready! Key settings:"
echo " • Default provider: Ollama (local, offline)"
echo " • Memory system: Enabled (learns from conversations)"
echo " • Port: 8080"
echo " • Port: 8081"
echo ""
print_warning "To use cloud providers (Databricks/OpenAI/Azure):"
echo " Edit: ${BLUE}nano $INSTALL_DIR/.env${NC}"
Expand Down Expand Up @@ -220,7 +236,7 @@ print_next_steps() {
echo " ${BLUE}lynkr${NC}"
echo ""
echo " 3. Configure Claude Code CLI:"
echo " ${BLUE}export ANTHROPIC_BASE_URL=http://localhost:8080${NC}"
echo " ${BLUE}export ANTHROPIC_BASE_URL=http://localhost:8081${NC}"
echo " ${BLUE}claude${NC}"
echo ""
echo " ${YELLOW}Option B: Use Cloud Providers (Databricks/OpenAI/Azure)${NC}"
Expand All @@ -238,7 +254,7 @@ print_next_steps() {
echo " ${BLUE}lynkr${NC}"
echo ""
echo " 3. Configure Claude Code CLI:"
echo " ${BLUE}export ANTHROPIC_BASE_URL=http://localhost:8080${NC}"
echo " ${BLUE}export ANTHROPIC_BASE_URL=http://localhost:8081${NC}"
echo " ${BLUE}export ANTHROPIC_API_KEY=any-non-empty-value${NC} ${GREEN}← Placeholder${NC}"
echo " ${BLUE}claude${NC}"
echo ""
Expand Down
4 changes: 3 additions & 1 deletion package.json
View file Open in desktop
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,15 @@
"lynkr-setup": "scripts/setup.js"
},
"scripts": {
"postinstall": "node scripts/check-native.js",
"rebuild-native": "node scripts/check-native.js",
"prestart": "node -e \"if(process.env.HEADROOM_ENABLED==='true'&&process.env.HEADROOM_DOCKER_ENABLED!=='false'){process.exit(0)}else{process.exit(1)}\" && docker compose --profile headroom up -d --build headroom 2>/dev/null || echo 'Headroom skipped (disabled or Docker not running)'",
"start": "node index.js 2>&1 | npx pino-pretty --sync",
"stop": "node -e \"if(process.env.HEADROOM_ENABLED==='true'&&process.env.HEADROOM_DOCKER_ENABLED!=='false'){process.exit(0)}else{process.exit(1)}\" && docker compose --profile headroom down || echo 'Headroom skipped (disabled or Docker not running)'",
"dev": "nodemon index.js",
"lint": "eslint src index.js",
"test": "npm run test:unit && npm run test:performance",
"test:unit": "DATABRICKS_API_KEY=test-key DATABRICKS_API_BASE=http://test.com node --test test/routing.test.js test/hybrid-routing-integration.test.js test/web-tools.test.js test/passthrough-mode.test.js test/openrouter-error-resilience.test.js test/format-conversion.test.js test/azure-openai-config.test.js test/azure-openai-format-conversion.test.js test/azure-openai-routing.test.js test/azure-openai-streaming.test.js test/azure-openai-error-resilience.test.js test/azure-openai-integration.test.js test/openai-integration.test.js test/toon-compression.test.js test/llamacpp-integration.test.js test/resilience.test.js test/telemetry-routing.test.js test/memory/store.test.js test/memory/surprise.test.js test/memory/extractor.test.js test/memory/search.test.js test/memory/retriever.test.js test/distill.test.js test/large-payload.test.js test/code-mode.test.js test/prompt-cache-injection.test.js test/risk-analyzer.test.js test/interaction-block.test.js test/preflight.test.js",
"test:unit": "DATABRICKS_API_KEY=test-key DATABRICKS_API_BASE=http://test.com node --test test/routing.test.js test/hybrid-routing-integration.test.js test/web-tools.test.js test/passthrough-mode.test.js test/openrouter-error-resilience.test.js test/format-conversion.test.js test/azure-openai-config.test.js test/azure-openai-format-conversion.test.js test/azure-openai-routing.test.js test/azure-openai-streaming.test.js test/azure-openai-error-resilience.test.js test/azure-openai-integration.test.js test/openai-integration.test.js test/toon-compression.test.js test/llamacpp-integration.test.js test/resilience.test.js test/telemetry-routing.test.js test/memory/store.test.js test/memory/surprise.test.js test/memory/extractor.test.js test/memory/search.test.js test/memory/retriever.test.js test/distill.test.js test/large-payload.test.js test/code-mode.test.js test/prompt-cache-injection.test.js test/risk-analyzer.test.js test/interaction-block.test.js test/preflight.test.js test/token-reduction.test.js test/session-affinity.test.js test/model-registry-cost.test.js",
"test:memory": "DATABRICKS_API_KEY=test-key DATABRICKS_API_BASE=http://test.com node --test test/memory/store.test.js test/memory/surprise.test.js test/memory/extractor.test.js test/memory/search.test.js test/memory/retriever.test.js",
"test:new-features": "DATABRICKS_API_KEY=test-key DATABRICKS_API_BASE=http://test.com node --test test/passthrough-mode.test.js test/openrouter-error-resilience.test.js test/format-conversion.test.js",
"test:performance": "DATABRICKS_API_KEY=test-key DATABRICKS_API_BASE=http://test.com node test/hybrid-routing-performance.test.js && DATABRICKS_API_KEY=test-key DATABRICKS_API_BASE=http://test.com node test/performance-tests.js",
Expand Down
14 changes: 13 additions & 1 deletion public/dashboard.html
View file Open in desktop
Original file line number Diff line number Diff line change
Expand Up @@ -244,17 +244,29 @@
const t = d.today;
const s = d.stats;

const tierLabel = t => t === 'default' ? 'default' : String(t).toLowerCase();
const providerCards = d.providers.length === 0
? `<p class="text-slate-500 text-sm">No providers configured</p>`
: d.providers.map(p => `
<div class="flex items-center justify-between bg-slate-700/50 rounded-lg px-4 py-3">
<div class="flex items-center gap-2">
<span class="status-dot ${providerDot(p.type)}"></span>
<span class="text-sm font-medium text-slate-200">${p.name}</span>
${(p.tiers || []).map(t => `<span class="badge bg-slate-600/60 text-slate-300">${tierLabel(t)}</span>`).join('')}
</div>
<span class="text-xs ${p.type === 'local' ? 'text-green-400' : 'text-blue-400'}">${p.type}</span>
</div>`).join('');

const providerWarnings = (d.providerWarnings || []).map(w => `
<div class="flex items-center justify-between bg-amber-500/10 border border-amber-500/30 rounded-lg px-4 py-3">
<div class="flex items-center gap-2">
<span class="text-amber-400 text-sm">⚠</span>
<span class="text-sm font-medium text-amber-200">${w.name}</span>
${(w.tiers || []).map(t => `<span class="badge bg-amber-500/20 text-amber-300">${tierLabel(t)}</span>`).join('')}
</div>
<span class="text-xs text-amber-400">no credentials</span>
</div>`).join('');

const recentRows = (d.recentRequests || []).map(r => `
<tr class="table-row border-b border-slate-700/50">
<td class="py-2 px-3 text-xs text-slate-500">${fmt.ago(r.timestamp)}</td>
Expand All @@ -279,7 +291,7 @@
<!-- Providers -->
${card(`
<h3 class="text-sm font-semibold text-slate-300 mb-3">Configured Providers</h3>
<div class="flex flex-col gap-2">${providerCards}</div>
<div class="flex flex-col gap-2">${providerCards}${providerWarnings}</div>
`)}

<!-- 24h Stats -->
Expand Down
Loading
Loading

AltStyle によって変換されたページ (->オリジナル) /