Copied to Clipboard
The scoring heuristic: sentences with less common words score higher (they are more content-dense). Very short sentences (fragments) and very long sentences (run-ons) are slightly penalized. First and last sentences of paragraphs receive a position bonus.
When to Use It
Use it for RAG (retrieval-augmented generation) pipelines where retrieved chunks may contain more text than necessary. Compress each chunk before assembling the context window. The compression ratio improves context utilization without requiring smaller chunk sizes.
Use it for user-uploaded document processing. Users upload PDFs, paste articles, submit long forms. Compressing the document before injecting it into the prompt can save significant cost on a high-volume system.
Use it as a pre-filter before semantic compression. For very long documents, apply heuristic compression first (cheap, fast, 60-70% reduction), then apply semantic compression on the smaller result (expensive, accurate). Two-stage compression is cheaper than one-stage semantic compression from scratch.
Skip it for short, curated prompts. If your system prompt is already tightly written, applying compression may remove important context. Compression is for untrusted, variable-length content (user input, retrieved documents), not your crafted instructions.
Install
pip install git+https://github.com/MukundaKatta/llm-prompt-compress
# Or from PyPI
pip install llm-prompt-compress
from llm_prompt_compress import PromptCompress
compress = PromptCompress(
target_tokens=6000,
strategy="balanced",
preserve_code=True, # Never compress code blocks
preserve_json=True, # Never compress JSON structures
)
def build_rag_context(chunks: list[str]) -> str:
compressed_chunks = []
for chunk in chunks:
original_tokens = compress.estimate_tokens(chunk)
if original_tokens > 500: # Only compress large chunks
compressed = compress.compress(chunk)
ratio = compress.ratio(chunk, compressed)
logger.debug("chunk_compressed", original=original_tokens, ratio=f"{ratio:.0%}")
compressed_chunks.append(compressed)
else:
compressed_chunks.append(chunk)
return "\n\n---\n\n".join(compressed_chunks)
def answer_with_rag(question: str, docs: list[str]) -> str:
context = build_rag_context(docs)
response = anthropic_client.messages.create(
model="claude-sonnet-4-6",
messages=[{
"role": "user",
"content": f"Documents:\n\n{context}\n\nQuestion: {question}",
}],
max_tokens=1024,
)
return response.content[0].text
Sibling Libraries
| Library |
What it solves |
llm-token-split |
Split documents into chunks before compression |
prompt-token-counter |
Count tokens to know when compression is needed |
agent-context-builder |
Section-based prompt assembly with per-section budgets |
agent-message-window |
Trim message list to fit context |
prompt-cache-warmer |
Cache the compressed prompt prefix for cheaper reruns |
The context optimization stack: prompt-token-counter to measure, llm-prompt-compress to reduce, llm-token-split to chunk, agent-message-window to trim, prompt-cache-warmer to cache.
What's Next
Query-aware compression: accept the user's question alongside the document and weight sentences that contain terms from the question more highly. This is a lightweight semantic signal without requiring a full embedding model.
Extractive summary mode: instead of removing sentences, produce an extractive summary (the K most important sentences in their original order) as the compressed output. This produces a shorter but coherent version rather than a gapped text.
Benchmark mode: compress.benchmark(text, questions) that compresses the text and then evaluates how many of the questions can be correctly answered from the compressed version. Provides a quality metric for tuning the compression strategy.
Built as part of the agent-stack family: composable Python primitives for production LLM agents.