Heuristic Prompt Compression: Cut Your Context Window Usage Without Losing Key Information

DEV Community

^\s*(?:Click here|Learn more|Subscribe|Sign up).*$', r'^\s*(?:Copyright|All rights reserved|Terms of service).*$', r'^\s*(?:Navigation|Menu|Footer|Header)\s*$', ] for p in patterns: text = re.sub(p, '', text, flags=re.MULTILINE | re.IGNORECASE) return text def _score_sentences(self, sentences: list[str]) -> list[float]: # TF-IDF inspired scoring without external libraries all_words = [s.lower().split() for s in sentences] word_freq = Counter(w for words in all_words for w in words) total_words = sum(word_freq.values()) scores = [] for words in all_words: if not words: scores.append(0.0) continue # Frequency score: prefer sentences with less common words freq_score = sum(1 / (word_freq[w] / total_words + 0.01) for w in words) / len(words) # Length penalty: very short and very long sentences score lower length_score = 1.0 - abs(len(words) - 20) / 100 # Position bonus: first and last sentences in paragraphs matter more scores.append(freq_score * max(0.1, length_score)) return scores

The scoring heuristic: sentences with less common words score higher (they are more content-dense). Very short sentences (fragments) and very long sentences (run-ons) are slightly penalized. First and last sentences of paragraphs receive a position bonus.

When to Use It

Use it for RAG (retrieval-augmented generation) pipelines where retrieved chunks may contain more text than necessary. Compress each chunk before assembling the context window. The compression ratio improves context utilization without requiring smaller chunk sizes.

Use it for user-uploaded document processing. Users upload PDFs, paste articles, submit long forms. Compressing the document before injecting it into the prompt can save significant cost on a high-volume system.

Use it as a pre-filter before semantic compression. For very long documents, apply heuristic compression first (cheap, fast, 60-70% reduction), then apply semantic compression on the smaller result (expensive, accurate). Two-stage compression is cheaper than one-stage semantic compression from scratch.

Skip it for short, curated prompts. If your system prompt is already tightly written, applying compression may remove important context. Compression is for untrusted, variable-length content (user input, retrieved documents), not your crafted instructions.

Install

pip install git+https://github.com/MukundaKatta/llm-prompt-compress
# Or from PyPI
pip install llm-prompt-compress

from llm_prompt_compress import PromptCompress
compress = PromptCompress(
 target_tokens=6000,
 strategy="balanced",
 preserve_code=True, # Never compress code blocks
 preserve_json=True, # Never compress JSON structures
)
def build_rag_context(chunks: list[str]) -> str:
 compressed_chunks = []
 for chunk in chunks:
 original_tokens = compress.estimate_tokens(chunk)
 if original_tokens > 500: # Only compress large chunks
 compressed = compress.compress(chunk)
 ratio = compress.ratio(chunk, compressed)
 logger.debug("chunk_compressed", original=original_tokens, ratio=f"{ratio:.0%}")
 compressed_chunks.append(compressed)
 else:
 compressed_chunks.append(chunk)
 return "\n\n---\n\n".join(compressed_chunks)
def answer_with_rag(question: str, docs: list[str]) -> str:
 context = build_rag_context(docs)
 response = anthropic_client.messages.create(
 model="claude-sonnet-4-6",
 messages=[{
 "role": "user",
 "content": f"Documents:\n\n{context}\n\nQuestion: {question}",
 }],
 max_tokens=1024,
 )
 return response.content[0].text

Sibling Libraries

Library	What it solves
`llm-token-split`	Split documents into chunks before compression
`prompt-token-counter`	Count tokens to know when compression is needed
`agent-context-builder`	Section-based prompt assembly with per-section budgets
`agent-message-window`	Trim message list to fit context
`prompt-cache-warmer`	Cache the compressed prompt prefix for cheaper reruns

The context optimization stack: prompt-token-counter to measure, llm-prompt-compress to reduce, llm-token-split to chunk, agent-message-window to trim, prompt-cache-warmer to cache.

What's Next

Query-aware compression: accept the user's question alongside the document and weight sentences that contain terms from the question more highly. This is a lightweight semantic signal without requiring a full embedding model.

Extractive summary mode: instead of removing sentences, produce an extractive summary (the K most important sentences in their original order) as the compressed output. This produces a shorter but coherent version rather than a gapped text.

Benchmark mode: compress.benchmark(text, questions) that compresses the text and then evaluates how many of the questions can be correctly answered from the compressed version. Provides a quality metric for tuning the compression strategy.

Built as part of the agent-stack family: composable Python primitives for production LLM agents.