Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Thinklanceai/llm-surgeon

Repository files navigation

llm-surgeon

Your LLM calls are too expensive, too slow, and too dumb. This fixes all three.

Drop-in middleware that compresses context, caches semantically, normalizes queries, self-heals bad responses, and routes to the cheapest model that meets your quality bar. Zero external API calls. No GPU. No framework lock-in.

Benchmark (177 queries, local TF-IDF, no API):
 Hit rate: 69%
 Token savings: 67%
 Latency: <1ms per query
 Monthly saving: ~219ドル/month at 50k queries/day

Install

npm install llm-surgeon

30-second quickstart

import { DecisionLayer } from "llm-surgeon";
const brain = new DecisionLayer();
// Before calling your LLM:
const decision = await brain.decide({
 history: conversationMessages,
 currentInput: userInput,
 priority: "normal",
});
if (decision.cacheHit) {
 // Serve cached response — 0ドル cost, <1ms
 return decision.cachedResponse;
}
// decision.messages = compressed context (60-80% fewer tokens)
// decision.recommendedModel = "fast" | "balanced" | "premium"
const response = await yourLLM(decision.messages);
// Record for cache + quality tracking
await brain.recordResponse(userInput, response);

That's it. Your next similar question is free.


What it does

5-layer context compression

Every conversation grows. By exchange 20, you're sending 15k tokens for what should be 1.5k. llm-surgeon reconstructs context with five layers:

Layer Function Typical tokens
Working Memory Last N raw exchanges 500-800
Episodic Summary Compressed older history 150-400
Fact Store Extracted entities (name, budget, tech stack) 30-80
Intent Snapshot What the user wants right now 20-50
Reconstructed Context Final payload sent to LLM 1.5k-2.5k

Semantic cache with query normalization

The cache detects similar questions using TF-IDF cosine similarity (or pluggable embedding providers). The query normalizer maps variants to canonical forms before lookup — "how to reduce LLM costs" and "comment diminuer les coûts LLM" hit the same cache entry.

Without normalizer: 0% hit rate on variants
With normalizer: 70% hit rate on same variants
Zero extra cost.

Self-healing cache

Bad responses don't stay cached. The quality estimator flags entries below threshold. When a better response arrives (via user feedback or a fresh LLM call), the cache replaces the old entry automatically:

// User signals a bad cached response
await brain.reportFeedback(prompt, betterResponse);
// Cache heals: quality 0.42 → 0.91, old response preserved in log

Intelligent model routing

Each request gets routed to the cheapest model that meets the quality bar. Priority "critical" goes to premium. Budget ceiling hit triggers graceful downgrade. The decision includes a human-readable reason:

model=balanced | priority=normal | compressed=-72% | nearest_cache=0.834

Quality estimation

Every response is scored locally (no API call) on length, coherence, completeness, and truncation risk. The system decides: cache it, retry it, or upgrade the model.


With Anthropic SDK

import Anthropic from "@anthropic-ai/sdk";
import { DecisionLayer } from "llm-surgeon";
import type { Message } from "llm-surgeon";
const client = new Anthropic();
const brain = new DecisionLayer();
const history: Message[] = [];
async function chat(userInput: string): Promise<string> {
 const decision = await brain.decide({
 history,
 currentInput: userInput,
 systemPrompt: "You are a helpful assistant.",
 priority: "normal",
 budgetCeiling: 0.01,
 qualityFloor: 0.7,
 });
 if (decision.cacheHit && decision.cachedResponse) {
 history.push({ role: "user", content: userInput });
 history.push({ role: "assistant", content: decision.cachedResponse });
 return decision.cachedResponse;
 }
 const model = decision.recommendedModel === "premium"
 ? "claude-sonnet-4-20250514"
 : "claude-haiku-4-5-20251001";
 const response = await client.messages.create({
 model,
 max_tokens: 1024,
 messages: decision.messages as Anthropic.MessageParam[],
 });
 const text = response.content[0].type === "text" ? response.content[0].text : "";
 await brain.recordResponse(userInput, text);
 history.push({ role: "user", content: userInput });
 history.push({ role: "assistant", content: text });
 return text;
}

With OpenAI SDK

import OpenAI from "openai";
import { DecisionLayer } from "llm-surgeon";
import type { Message } from "llm-surgeon";
const client = new OpenAI();
const brain = new DecisionLayer();
const history: Message[] = [];
async function chat(userInput: string): Promise<string> {
 const decision = await brain.decide({
 history,
 currentInput: userInput,
 priority: "normal",
 budgetCeiling: 0.005,
 });
 if (decision.cacheHit && decision.cachedResponse) return decision.cachedResponse;
 const model = { fast: "gpt-4o-mini", balanced: "gpt-4o", premium: "o1" }[decision.recommendedModel];
 const response = await client.chat.completions.create({
 model,
 messages: decision.messages as OpenAI.ChatCompletionMessageParam[],
 });
 const text = response.choices[0].message.content ?? "";
 await brain.recordResponse(userInput, text);
 history.push({ role: "user", content: userInput });
 history.push({ role: "assistant", content: text });
 return text;
}

Embedding providers (optional)

Local TF-IDF works with zero setup. For higher accuracy, plug in an API provider:

import { DecisionLayer, OpenAIEmbeddingProvider } from "llm-surgeon";
const brain = new DecisionLayer({
 cacheSimilarityThreshold: 0.92,
 embeddingProvider: new OpenAIEmbeddingProvider({
 apiKey: process.env.OPENAI_API_KEY!,
 model: "text-embedding-3-small",
 }),
});
import { DecisionLayer, VoyageEmbeddingProvider } from "llm-surgeon";
const brain = new DecisionLayer({
 cacheSimilarityThreshold: 0.92,
 embeddingProvider: new VoyageEmbeddingProvider({
 apiKey: process.env.VOYAGE_API_KEY!,
 model: "voyage-3-lite", // 0ドル.00002/1k tokens
 }),
});

The provider interface is open — implement EmbeddingProvider for any backend.


API reference

DecisionLayer

The main entry point. Combines compression, cache, normalization, healing, and routing.

const brain = new DecisionLayer({
 cacheSimilarityThreshold: 0.88,
 defaultQualityFloor: 0.65,
 tokenPriceFast: 0.00025,
 tokenPriceBalanced: 0.003,
 tokenPricePremium: 0.015,
 embeddingProvider: yourProvider,
});
const out = await brain.decide({ history, currentInput, systemPrompt, priority, budgetCeiling, qualityFloor });
const record = await brain.recordResponse(prompt, response);
const feedback = await brain.reportFeedback(prompt, betterResponse);
const stats = brain.getCacheStats();
const flagged = brain.getFlaggedEntries();
const log = brain.getHealingLog();

SemanticCache

Standalone cache with normalizer and self-healing built in.

const cache = new SemanticCache({
 similarityThreshold: 0.88,
 maxEntries: 1000,
 ttlMs: 4 * 60 * 60 * 1000,
 enableNormalizer: true,
 normalizerThreshold: 0.82,
 healing: { qualityThreshold: 0.55, maxHealCount: 3, autoFlagBelowQuality: 0.45 },
});
const result = await cache.lookup(prompt);
await cache.store_entry(prompt, response, qualityScore);
cache.flagForHealing(prompt, reason);
await cache.healEntry(prompt, betterResponse, newQuality);

SurgeonEngine

Context compression only — no cache, no routing.

const engine = new SurgeonEngine({ maxOutputTokens: 2048, workingMemorySize: 6, minFactConfidence: 0.6 });
const ctx = engine.buildContext({ history, currentInput, systemPrompt });
// ctx.messages, ctx.totalTokens, ctx.compressionRatio

Quality estimator

import { estimateQuality, shouldRetry, shouldCache, shouldUpgradeModel } from "llm-surgeon";
const signals = estimateQuality(prompt, response);
// signals.overallScore, signals.flags, signals.truncationRisk

Architecture

src/
├── index.ts ← public API + SurgeonEngine
├── types.ts ← all interfaces
├── memory/
│ ├── working-memory.ts ← last N raw exchanges
│ └── episodic-summary.ts ← adaptive compression of overflow
├── extractor/
│ └── fact-extractor.ts ← entity extraction (regex, zero API)
├── context/
│ ├── intent-snapshot.ts ← current intent detection
│ └── reconstructor.ts ← 5-layer context assembly
├── cache/
│ ├── semantic-cache.ts ← TF-IDF cache + normalizer + self-healing
│ └── redis-adapter.ts ← optional Redis persistence
├── normalizer/
│ └── query-normalizer.ts ← canonical form detection + variant mapping
├── quality/
│ └── estimator.ts ← local quality scoring (no API)
├── decision/
│ └── layer.ts ← routing brain + feedback loop
└── utils/
 ├── tokenizer.ts ← tiktoken-based counting
 ├── embedder.ts ← local TF-IDF cosine
 ├── embedding-provider.ts ← abstract provider interface
 ├── local-provider.ts ← TF-IDF wrapped as provider
 ├── openai-provider.ts ← text-embedding-3-small/large
 └── voyage-provider.ts ← voyage-3 / voyage-3-lite
tests/
└── surgery.test.ts ← 66 tests, zero external deps
bench/
└── benchmark.ts ← realistic traffic simulation

Benchmark

Run locally:

npm run bench

Results (177 queries, 8 topic clusters, 10 unique queries, 3 traffic waves):

Metric Value
Hit rate 69%
Token savings 67%
Avg latency <1ms
Monthly savings (50k/day) ~219ドル
External API calls 0
Healing events auto-corrected bad responses

Tests

npm test
# 66 tests across 12 suites, zero external dependencies

Roadmap

  • Redis persistence (hydrate cache on restart)
  • Observability dashboard (cost/hit rate/quality in real-time)
  • Interactive playground (paste a prompt, see the system in action)

License

MIT

About

Drop-in LLM middleware — semantic cache, query normalization, self-healing, context compression. 67% token savings, <1ms latency, zero API calls.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

AltStyle によって変換されたページ (->オリジナル) /