Copied to Clipboard
Sibling Libraries
| Library |
What it solves |
llm-cache-mem |
Cross-time deduplication (LRU+TTL for non-concurrent identical calls) |
agentidemp-py |
Request-level idempotency for non-concurrent duplicate agent runs |
llm-rate-limit-bucket |
Token-bucket rate limiter for outbound LLM calls |
llm-retry |
Exponential backoff retry when calls fail |
token-budget-pool |
Thread-safe concurrent token/USD budget tracking |
The concurrency stack: llm-batch-coalesce for in-flight dedup, llm-cache-mem for cross-time dedup, llm-rate-limit-bucket for rate limiting, llm-retry for failure recovery.
What's Next
Waiter count metrics: coalesce.stats() returning how many times a request was coalesced (waiter count per key), how much the coalescing saved in estimated API cost, and the current in-flight count. Makes the value of the coalescer visible in dashboards.
Configurable key function: LLMCoalesce(key_fn=...) for cases where you want to coalesce on a subset of the kwargs (e.g., coalesce on messages content only, ignoring model differences). Some use cases allow sharing responses across model variants.
Sync version: SyncLLMCoalesce using threading.Event instead of asyncio.Future for sync contexts. The async version requires an event loop; the sync version would work in traditional multi-threaded WSGI apps.
Built as part of the agent-stack family: composable Python primitives for production LLM agents.