Name	Name	Last commit message	Last commit date
Latest commit History 1 Commit
bench	bench
docs	docs
examples	examples
src	src
tests	tests
.gitignore	.gitignore
.npmignore	.npmignore
LICENSE	LICENSE
README.md	README.md
package-lock.json	package-lock.json
package.json	package.json
tsconfig.json	tsconfig.json

llm-surgeon

Your LLM calls are too expensive, too slow, and too dumb. This fixes all three.

Drop-in middleware that compresses context, caches semantically, normalizes queries, self-heals bad responses, and routes to the cheapest model that meets your quality bar. Zero external API calls. No GPU. No framework lock-in.

Benchmark (177 queries, local TF-IDF, no API):
 Hit rate: 69%
 Token savings: 67%
 Latency: <1ms per query
 Monthly saving: ~219ドル/month at 50k queries/day

Install

npm install llm-surgeon

30-second quickstart

import { DecisionLayer } from "llm-surgeon";
const brain = new DecisionLayer();
// Before calling your LLM:
const decision = await brain.decide({
 history: conversationMessages,
 currentInput: userInput,
 priority: "normal",
});
if (decision.cacheHit) {
 // Serve cached response — 0ドル cost, <1ms
 return decision.cachedResponse;
}
// decision.messages = compressed context (60-80% fewer tokens)
// decision.recommendedModel = "fast" | "balanced" | "premium"
const response = await yourLLM(decision.messages);
// Record for cache + quality tracking
await brain.recordResponse(userInput, response);

That's it. Your next similar question is free.

What it does

5-layer context compression

Every conversation grows. By exchange 20, you're sending 15k tokens for what should be 1.5k. llm-surgeon reconstructs context with five layers:

Layer	Function	Typical tokens
Working Memory	Last N raw exchanges	500-800
Episodic Summary	Compressed older history	150-400
Fact Store	Extracted entities (name, budget, tech stack)	30-80
Intent Snapshot	What the user wants right now	20-50
Reconstructed Context	Final payload sent to LLM	1.5k-2.5k

Semantic cache with query normalization

The cache detects similar questions using TF-IDF cosine similarity (or pluggable embedding providers). The query normalizer maps variants to canonical forms before lookup — "how to reduce LLM costs" and "comment diminuer les coûts LLM" hit the same cache entry.

Without normalizer: 0% hit rate on variants
With normalizer: 70% hit rate on same variants
Zero extra cost.

Self-healing cache

Bad responses don't stay cached. The quality estimator flags entries below threshold. When a better response arrives (via user feedback or a fresh LLM call), the cache replaces the old entry automatically:

// User signals a bad cached response
await brain.reportFeedback(prompt, betterResponse);
// Cache heals: quality 0.42 → 0.91, old response preserved in log

Intelligent model routing

Each request gets routed to the cheapest model that meets the quality bar. Priority "critical" goes to premium. Budget ceiling hit triggers graceful downgrade. The decision includes a human-readable reason:

model=balanced | priority=normal | compressed=-72% | nearest_cache=0.834

Quality estimation

Every response is scored locally (no API call) on length, coherence, completeness, and truncation risk. The system decides: cache it, retry it, or upgrade the model.

With Anthropic SDK

import Anthropic from "@anthropic-ai/sdk";
import { DecisionLayer } from "llm-surgeon";
import type { Message } from "llm-surgeon";
const client = new Anthropic();
const brain = new DecisionLayer();
const history: Message[] = [];
async function chat(userInput: string): Promise<string> {
 const decision = await brain.decide({
 history,
 currentInput: userInput,
 systemPrompt: "You are a helpful assistant.",
 priority: "normal",
 budgetCeiling: 0.01,
 qualityFloor: 0.7,
 });
 if (decision.cacheHit && decision.cachedResponse) {
 history.push({ role: "user", content: userInput });
 history.push({ role: "assistant", content: decision.cachedResponse });
 return decision.cachedResponse;
 }
 const model = decision.recommendedModel === "premium"
 ? "claude-sonnet-4-20250514"
 : "claude-haiku-4-5-20251001";
 const response = await client.messages.create({
 model,
 max_tokens: 1024,
 messages: decision.messages as Anthropic.MessageParam[],
 });
 const text = response.content[0].type === "text" ? response.content[0].text : "";
 await brain.recordResponse(userInput, text);
 history.push({ role: "user", content: userInput });
 history.push({ role: "assistant", content: text });
 return text;
}

With OpenAI SDK

import OpenAI from "openai";
import { DecisionLayer } from "llm-surgeon";
import type { Message } from "llm-surgeon";
const client = new OpenAI();
const brain = new DecisionLayer();
const history: Message[] = [];
async function chat(userInput: string): Promise<string> {
 const decision = await brain.decide({
 history,
 currentInput: userInput,
 priority: "normal",
 budgetCeiling: 0.005,
 });
 if (decision.cacheHit && decision.cachedResponse) return decision.cachedResponse;
 const model = { fast: "gpt-4o-mini", balanced: "gpt-4o", premium: "o1" }[decision.recommendedModel];
 const response = await client.chat.completions.create({
 model,
 messages: decision.messages as OpenAI.ChatCompletionMessageParam[],
 });
 const text = response.choices[0].message.content ?? "";
 await brain.recordResponse(userInput, text);
 history.push({ role: "user", content: userInput });
 history.push({ role: "assistant", content: text });
 return text;
}

Embedding providers (optional)

Local TF-IDF works with zero setup. For higher accuracy, plug in an API provider:

import { DecisionLayer, OpenAIEmbeddingProvider } from "llm-surgeon";
const brain = new DecisionLayer({
 cacheSimilarityThreshold: 0.92,
 embeddingProvider: new OpenAIEmbeddingProvider({
 apiKey: process.env.OPENAI_API_KEY!,
 model: "text-embedding-3-small",
 }),
});

import { DecisionLayer, VoyageEmbeddingProvider } from "llm-surgeon";
const brain = new DecisionLayer({
 cacheSimilarityThreshold: 0.92,
 embeddingProvider: new VoyageEmbeddingProvider({
 apiKey: process.env.VOYAGE_API_KEY!,
 model: "voyage-3-lite", // 0ドル.00002/1k tokens
 }),
});

The provider interface is open — implement EmbeddingProvider for any backend.

API reference

DecisionLayer

The main entry point. Combines compression, cache, normalization, healing, and routing.

const brain = new DecisionLayer({
 cacheSimilarityThreshold: 0.88,
 defaultQualityFloor: 0.65,
 tokenPriceFast: 0.00025,
 tokenPriceBalanced: 0.003,
 tokenPricePremium: 0.015,
 embeddingProvider: yourProvider,
});
const out = await brain.decide({ history, currentInput, systemPrompt, priority, budgetCeiling, qualityFloor });
const record = await brain.recordResponse(prompt, response);
const feedback = await brain.reportFeedback(prompt, betterResponse);
const stats = brain.getCacheStats();
const flagged = brain.getFlaggedEntries();
const log = brain.getHealingLog();

SemanticCache

Standalone cache with normalizer and self-healing built in.

const cache = new SemanticCache({
 similarityThreshold: 0.88,
 maxEntries: 1000,
 ttlMs: 4 * 60 * 60 * 1000,
 enableNormalizer: true,
 normalizerThreshold: 0.82,
 healing: { qualityThreshold: 0.55, maxHealCount: 3, autoFlagBelowQuality: 0.45 },
});
const result = await cache.lookup(prompt);
await cache.store_entry(prompt, response, qualityScore);
cache.flagForHealing(prompt, reason);
await cache.healEntry(prompt, betterResponse, newQuality);

SurgeonEngine

Context compression only — no cache, no routing.

const engine = new SurgeonEngine({ maxOutputTokens: 2048, workingMemorySize: 6, minFactConfidence: 0.6 });
const ctx = engine.buildContext({ history, currentInput, systemPrompt });
// ctx.messages, ctx.totalTokens, ctx.compressionRatio

Quality estimator

import { estimateQuality, shouldRetry, shouldCache, shouldUpgradeModel } from "llm-surgeon";
const signals = estimateQuality(prompt, response);
// signals.overallScore, signals.flags, signals.truncationRisk

Architecture

src/
├── index.ts ← public API + SurgeonEngine
├── types.ts ← all interfaces
├── memory/
│ ├── working-memory.ts ← last N raw exchanges
│ └── episodic-summary.ts ← adaptive compression of overflow
├── extractor/
│ └── fact-extractor.ts ← entity extraction (regex, zero API)
├── context/
│ ├── intent-snapshot.ts ← current intent detection
│ └── reconstructor.ts ← 5-layer context assembly
├── cache/
│ ├── semantic-cache.ts ← TF-IDF cache + normalizer + self-healing
│ └── redis-adapter.ts ← optional Redis persistence
├── normalizer/
│ └── query-normalizer.ts ← canonical form detection + variant mapping
├── quality/
│ └── estimator.ts ← local quality scoring (no API)
├── decision/
│ └── layer.ts ← routing brain + feedback loop
└── utils/
 ├── tokenizer.ts ← tiktoken-based counting
 ├── embedder.ts ← local TF-IDF cosine
 ├── embedding-provider.ts ← abstract provider interface
 ├── local-provider.ts ← TF-IDF wrapped as provider
 ├── openai-provider.ts ← text-embedding-3-small/large
 └── voyage-provider.ts ← voyage-3 / voyage-3-lite
tests/
└── surgery.test.ts ← 66 tests, zero external deps
bench/
└── benchmark.ts ← realistic traffic simulation

Benchmark

Run locally:

npm run bench

Results (177 queries, 8 topic clusters, 10 unique queries, 3 traffic waves):

Metric	Value
Hit rate	69%
Token savings	67%
Avg latency	<1ms
Monthly savings (50k/day)	~219ドル
External API calls	0
Healing events	auto-corrected bad responses

Tests

npm test
# 66 tests across 12 suites, zero external dependencies

Roadmap

Redis persistence (hydrate cache on restart)
Observability dashboard (cost/hit rate/quality in real-time)
Interactive playground (paste a prompt, see the system in action)

License

MIT

Folders and files

Latest commit

History

Repository files navigation

llm-surgeon

Install

30-second quickstart

What it does

5-layer context compression

Semantic cache with query normalization

Self-healing cache

Intelligent model routing

Quality estimation

With Anthropic SDK

With OpenAI SDK

Embedding providers (optional)

API reference

DecisionLayer

SemanticCache

SurgeonEngine

Quality estimator

Architecture

Benchmark

Tests

Roadmap

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages