from anthropic_batch_kit import BatchKit
kit = BatchKit(api_key="your-key")
# Build requests
requests = []
for i, doc in enumerate(documents):
requests.append({
"custom_id": f"doc-{i}",
"model": "claude-sonnet-4-6",
"max_tokens": 256,
"messages": [
{"role": "user", "content": f"Summarize in one sentence: {doc}"}
]
})
# Submit batch
batch_id = kit.submit(requests)
print(f"Submitted batch {batch_id}")
Then, separately (in a cron job, a scheduled task, or a later script run):
from anthropic_batch_kit import BatchKit
kit = BatchKit(api_key="your-key")
# Poll until complete (blocks with sleep intervals)
results = kit.poll_and_retrieve(batch_id, poll_interval_seconds=60)
# results is a dict keyed by custom_id
for doc_id, result in results.items():
if result["type"] == "succeeded":
print(f"{doc_id}: {result['content']}")
else:
print(f"{doc_id}: FAILED - {result['error']}")
The poll_and_retrieve method handles the polling loop internally. It checks the batch status every N seconds and returns when the batch reaches a terminal state (succeeded, errored, or expired).
Cost Calculation
Here is how to think about the cost difference for a typical eval run.
Assume 500 prompts, each with 1,200 input tokens and 400 output tokens.
At standard Sonnet pricing: 1ドル.80 input + 3ドル.00 output = 4ドル.80 total.
At batch pricing (50% off): 2ドル.40 total. That is a 2ドル.40 saving per run.
Run that eval twice a day and you save 1,752ドル per year on eval costs alone. For teams running large eval suites, the savings are significant.
def estimate_batch_cost(
num_requests: int,
avg_input_tokens: int,
avg_output_tokens: int,
model: str = "claude-sonnet-4-6"
) -> dict:
# Standard pricing per million tokens
pricing = {
"claude-sonnet-4-6": {"input": 3.00, "output": 15.00},
}
p = pricing.get(model, {"input": 3.00, "output": 15.00})
total_input = num_requests * avg_input_tokens
total_output = num_requests * avg_output_tokens
standard_cost = (total_input / 1_000_000 * p["input"] +
total_output / 1_000_000 * p["output"])
batch_cost = standard_cost * 0.50
return {
"standard_usd": round(standard_cost, 4),
"batch_usd": round(batch_cost, 4),
"savings_usd": round(standard_cost - batch_cost, 4),
}
Comparing With llm-batch-coalesce
These are two different things that can be confused.
llm-batch-coalesce is about single-flighting concurrent requests to the same prompt. If ten different parts of your code call the model with the same prompt at the same time, llm-batch-coalesce detects the duplicate and makes one API call, sharing the result with all ten callers. This is a synchronous optimization, not the Batch API.
The Anthropic Message Batches API is about submitting many independent requests in one HTTP call and getting results back asynchronously. Different mechanism, different use case.
Use llm-batch-coalesce when you have concurrent code making duplicate synchronous calls. Use the Batch API when you have independent work that does not need to complete within seconds.
| Feature |
llm-batch-coalesce |
Anthropic Batch API |
| Latency |
Synchronous (seconds) |
Async (up to 24h) |
| Use case |
Dedup concurrent calls |
High-volume offline work |
| Cost benefit |
No discount |
50% discount |
| Max requests |
N/A (per-request) |
10,000 per batch |
| Streaming |
Supported |
Not supported |
| Result order |
Immediate |
Batch completion |
Tradeoffs to Know
Debugging is harder. When a synchronous call fails, you find out immediately. When a batch request fails, you find out when the batch completes. If you submitted 1,000 requests and 50 failed, you need to identify which ones and why. Build retry logic to handle partial batch failures.
No SLA below 24 hours. Anthropic processes batches on a best-effort basis. Most small batches complete within an hour, but the guaranteed SLA is 24 hours. Do not use it when you need a result within the hour.
Results need parsing. The Batch API returns JSONL, not a simple list. Each line is a JSON object with a custom_id, a result type, and either content or an error. anthropic-batch-kit handles this parsing for you, but if you are rolling your own client, budget time for it.
Context windows still apply. Each request in the batch is still subject to the model's context window limit. You cannot use the Batch API to send a larger context than the model supports.
Quick Start
pip install anthropic-batch-kit
from anthropic_batch_kit import BatchKit
kit = BatchKit() # reads ANTHROPIC_API_KEY from env
# Submit
batch_id = kit.submit([
{
"custom_id": "item-1",
"model": "claude-sonnet-4-6",
"max_tokens": 128,
"messages": [{"role": "user", "content": "What is 2+2?"}]
}
])
# Later: retrieve
results = kit.poll_and_retrieve(batch_id)
print(results["item-1"]["content"])
Related Tools
| Tool |
Purpose |
| anthropic-batch-kit |
Submit, poll, retrieve Anthropic batches |
| llm-batch-coalesce |
Single-flight dedup for concurrent sync calls |
| llm-cost-cap |
Pre-flight USD gate for synchronous calls |
| agenttrace |
Per-run cost tracking for synchronous agents |
| llm-fallback-router |
Provider failover for synchronous calls |
What Is Next
The Batch API is one of the more underused cost levers available. Most teams default to synchronous calls everywhere even when the use case is async-friendly. If you have eval runs or bulk processing jobs, measure your current monthly spend, apply the 50% discount, and decide if the async complexity is worth it.
Source and examples are at MukundaKatta/anthropic-batch-kit on GitHub.