Your LLM calls are too expensive, too slow, and too dumb. This fixes all three.
Drop-in middleware that compresses context, caches semantically, normalizes queries, self-heals bad responses, and routes to the cheapest model that meets your quality bar. Zero external API calls. No GPU. No framework lock-in.
Benchmark (177 queries, local TF-IDF, no API):
Hit rate: 69%
Token savings: 67%
Latency: <1ms per query
Monthly saving: ~219ドル/month at 50k queries/day
npm install llm-surgeon
import { DecisionLayer } from "llm-surgeon"; const brain = new DecisionLayer(); // Before calling your LLM: const decision = await brain.decide({ history: conversationMessages, currentInput: userInput, priority: "normal", }); if (decision.cacheHit) { // Serve cached response — 0ドル cost, <1ms return decision.cachedResponse; } // decision.messages = compressed context (60-80% fewer tokens) // decision.recommendedModel = "fast" | "balanced" | "premium" const response = await yourLLM(decision.messages); // Record for cache + quality tracking await brain.recordResponse(userInput, response);
That's it. Your next similar question is free.
Every conversation grows. By exchange 20, you're sending 15k tokens for what should be 1.5k. llm-surgeon reconstructs context with five layers:
| Layer | Function | Typical tokens |
|---|---|---|
| Working Memory | Last N raw exchanges | 500-800 |
| Episodic Summary | Compressed older history | 150-400 |
| Fact Store | Extracted entities (name, budget, tech stack) | 30-80 |
| Intent Snapshot | What the user wants right now | 20-50 |
| Reconstructed Context | Final payload sent to LLM | 1.5k-2.5k |
The cache detects similar questions using TF-IDF cosine similarity (or pluggable embedding providers). The query normalizer maps variants to canonical forms before lookup — "how to reduce LLM costs" and "comment diminuer les coûts LLM" hit the same cache entry.
Without normalizer: 0% hit rate on variants
With normalizer: 70% hit rate on same variants
Zero extra cost.
Bad responses don't stay cached. The quality estimator flags entries below threshold. When a better response arrives (via user feedback or a fresh LLM call), the cache replaces the old entry automatically:
// User signals a bad cached response await brain.reportFeedback(prompt, betterResponse); // Cache heals: quality 0.42 → 0.91, old response preserved in log
Each request gets routed to the cheapest model that meets the quality bar. Priority "critical" goes to premium. Budget ceiling hit triggers graceful downgrade. The decision includes a human-readable reason:
model=balanced | priority=normal | compressed=-72% | nearest_cache=0.834
Every response is scored locally (no API call) on length, coherence, completeness, and truncation risk. The system decides: cache it, retry it, or upgrade the model.
import Anthropic from "@anthropic-ai/sdk"; import { DecisionLayer } from "llm-surgeon"; import type { Message } from "llm-surgeon"; const client = new Anthropic(); const brain = new DecisionLayer(); const history: Message[] = []; async function chat(userInput: string): Promise<string> { const decision = await brain.decide({ history, currentInput: userInput, systemPrompt: "You are a helpful assistant.", priority: "normal", budgetCeiling: 0.01, qualityFloor: 0.7, }); if (decision.cacheHit && decision.cachedResponse) { history.push({ role: "user", content: userInput }); history.push({ role: "assistant", content: decision.cachedResponse }); return decision.cachedResponse; } const model = decision.recommendedModel === "premium" ? "claude-sonnet-4-20250514" : "claude-haiku-4-5-20251001"; const response = await client.messages.create({ model, max_tokens: 1024, messages: decision.messages as Anthropic.MessageParam[], }); const text = response.content[0].type === "text" ? response.content[0].text : ""; await brain.recordResponse(userInput, text); history.push({ role: "user", content: userInput }); history.push({ role: "assistant", content: text }); return text; }
import OpenAI from "openai"; import { DecisionLayer } from "llm-surgeon"; import type { Message } from "llm-surgeon"; const client = new OpenAI(); const brain = new DecisionLayer(); const history: Message[] = []; async function chat(userInput: string): Promise<string> { const decision = await brain.decide({ history, currentInput: userInput, priority: "normal", budgetCeiling: 0.005, }); if (decision.cacheHit && decision.cachedResponse) return decision.cachedResponse; const model = { fast: "gpt-4o-mini", balanced: "gpt-4o", premium: "o1" }[decision.recommendedModel]; const response = await client.chat.completions.create({ model, messages: decision.messages as OpenAI.ChatCompletionMessageParam[], }); const text = response.choices[0].message.content ?? ""; await brain.recordResponse(userInput, text); history.push({ role: "user", content: userInput }); history.push({ role: "assistant", content: text }); return text; }
Local TF-IDF works with zero setup. For higher accuracy, plug in an API provider:
import { DecisionLayer, OpenAIEmbeddingProvider } from "llm-surgeon"; const brain = new DecisionLayer({ cacheSimilarityThreshold: 0.92, embeddingProvider: new OpenAIEmbeddingProvider({ apiKey: process.env.OPENAI_API_KEY!, model: "text-embedding-3-small", }), });
import { DecisionLayer, VoyageEmbeddingProvider } from "llm-surgeon"; const brain = new DecisionLayer({ cacheSimilarityThreshold: 0.92, embeddingProvider: new VoyageEmbeddingProvider({ apiKey: process.env.VOYAGE_API_KEY!, model: "voyage-3-lite", // 0ドル.00002/1k tokens }), });
The provider interface is open — implement EmbeddingProvider for any backend.
The main entry point. Combines compression, cache, normalization, healing, and routing.
const brain = new DecisionLayer({ cacheSimilarityThreshold: 0.88, defaultQualityFloor: 0.65, tokenPriceFast: 0.00025, tokenPriceBalanced: 0.003, tokenPricePremium: 0.015, embeddingProvider: yourProvider, }); const out = await brain.decide({ history, currentInput, systemPrompt, priority, budgetCeiling, qualityFloor }); const record = await brain.recordResponse(prompt, response); const feedback = await brain.reportFeedback(prompt, betterResponse); const stats = brain.getCacheStats(); const flagged = brain.getFlaggedEntries(); const log = brain.getHealingLog();
Standalone cache with normalizer and self-healing built in.
const cache = new SemanticCache({ similarityThreshold: 0.88, maxEntries: 1000, ttlMs: 4 * 60 * 60 * 1000, enableNormalizer: true, normalizerThreshold: 0.82, healing: { qualityThreshold: 0.55, maxHealCount: 3, autoFlagBelowQuality: 0.45 }, }); const result = await cache.lookup(prompt); await cache.store_entry(prompt, response, qualityScore); cache.flagForHealing(prompt, reason); await cache.healEntry(prompt, betterResponse, newQuality);
Context compression only — no cache, no routing.
const engine = new SurgeonEngine({ maxOutputTokens: 2048, workingMemorySize: 6, minFactConfidence: 0.6 }); const ctx = engine.buildContext({ history, currentInput, systemPrompt }); // ctx.messages, ctx.totalTokens, ctx.compressionRatio
import { estimateQuality, shouldRetry, shouldCache, shouldUpgradeModel } from "llm-surgeon"; const signals = estimateQuality(prompt, response); // signals.overallScore, signals.flags, signals.truncationRisk
src/
├── index.ts ← public API + SurgeonEngine
├── types.ts ← all interfaces
├── memory/
│ ├── working-memory.ts ← last N raw exchanges
│ └── episodic-summary.ts ← adaptive compression of overflow
├── extractor/
│ └── fact-extractor.ts ← entity extraction (regex, zero API)
├── context/
│ ├── intent-snapshot.ts ← current intent detection
│ └── reconstructor.ts ← 5-layer context assembly
├── cache/
│ ├── semantic-cache.ts ← TF-IDF cache + normalizer + self-healing
│ └── redis-adapter.ts ← optional Redis persistence
├── normalizer/
│ └── query-normalizer.ts ← canonical form detection + variant mapping
├── quality/
│ └── estimator.ts ← local quality scoring (no API)
├── decision/
│ └── layer.ts ← routing brain + feedback loop
└── utils/
├── tokenizer.ts ← tiktoken-based counting
├── embedder.ts ← local TF-IDF cosine
├── embedding-provider.ts ← abstract provider interface
├── local-provider.ts ← TF-IDF wrapped as provider
├── openai-provider.ts ← text-embedding-3-small/large
└── voyage-provider.ts ← voyage-3 / voyage-3-lite
tests/
└── surgery.test.ts ← 66 tests, zero external deps
bench/
└── benchmark.ts ← realistic traffic simulation
Run locally:
npm run bench
Results (177 queries, 8 topic clusters, 10 unique queries, 3 traffic waves):
| Metric | Value |
|---|---|
| Hit rate | 69% |
| Token savings | 67% |
| Avg latency | <1ms |
| Monthly savings (50k/day) | ~219ドル |
| External API calls | 0 |
| Healing events | auto-corrected bad responses |
npm test # 66 tests across 12 suites, zero external dependencies
- Redis persistence (hydrate cache on restart)
- Observability dashboard (cost/hit rate/quality in real-time)
- Interactive playground (paste a prompt, see the system in action)
MIT