llm-budget-window: Per-Minute and Per-Hour Token Caps That Actually Work

DEV Community

Install

[dependencies]
llm-budget-window = "0.1"

crates.io: llm-budget-window
GitHub: MukundaKatta/llm-budget-window

Siblings

Crate / Package	What it does
token-budget-pool	Thread-safe total session cap, not time-windowed; pair for belt-and-suspenders
token-budget-py	Python port of token-budget-pool
llm-cost-cap	Pre-flight cost gate: estimates cost before the call and rejects if over budget
llm-circuit-breaker	Open/closed/half-open circuit breaker; stops calling a provider that keeps failing
claude-cost	Compute per-call USD cost for Anthropic API calls, including cache read rates

What is next

The most requested feature is a Backpressure mode that does not reject the call but instead blocks (or returns a future that resolves after the reset time) until the window clears. Right now, record returns Err immediately when a window is exceeded. The caller decides whether to sleep, abort, or queue. A built-in backpressure mode with an async record_or_wait function would make the common "wait and retry" pattern require less boilerplate.

Separate per-model windows are also on the list. A single budget window covers all models combined. If you are running multiple models simultaneously and want separate per-minute caps per model (to match per-model rate limits from the provider), you need multiple BudgetWindow instances today. A BudgetWindowMap that routes by model ID would make that cleaner.

Part of the Hermes Agent Challenge sprint. All crates shipped on crates.io.