When and How to Use the Anthropic Batch API in Your Agent

DEV Community

from anthropic_batch_kit import BatchKit
kit = BatchKit(api_key="your-key")
# Build requests
requests = []
for i, doc in enumerate(documents):
 requests.append({
 "custom_id": f"doc-{i}",
 "model": "claude-sonnet-4-6",
 "max_tokens": 256,
 "messages": [
 {"role": "user", "content": f"Summarize in one sentence: {doc}"}
 ]
 })
# Submit batch
batch_id = kit.submit(requests)
print(f"Submitted batch {batch_id}")

Then, separately (in a cron job, a scheduled task, or a later script run):

from anthropic_batch_kit import BatchKit
kit = BatchKit(api_key="your-key")
# Poll until complete (blocks with sleep intervals)
results = kit.poll_and_retrieve(batch_id, poll_interval_seconds=60)
# results is a dict keyed by custom_id
for doc_id, result in results.items():
 if result["type"] == "succeeded":
 print(f"{doc_id}: {result['content']}")
 else:
 print(f"{doc_id}: FAILED - {result['error']}")

The poll_and_retrieve method handles the polling loop internally. It checks the batch status every N seconds and returns when the batch reaches a terminal state (succeeded, errored, or expired).

Cost Calculation

Here is how to think about the cost difference for a typical eval run.

Assume 500 prompts, each with 1,200 input tokens and 400 output tokens.

At standard Sonnet pricing: 1ドル.80 input + 3ドル.00 output = 4ドル.80 total.
At batch pricing (50% off): 2ドル.40 total. That is a 2ドル.40 saving per run.

Run that eval twice a day and you save 1,752ドル per year on eval costs alone. For teams running large eval suites, the savings are significant.

def estimate_batch_cost(
 num_requests: int,
 avg_input_tokens: int,
 avg_output_tokens: int,
 model: str = "claude-sonnet-4-6"
) -> dict:
 # Standard pricing per million tokens
 pricing = {
 "claude-sonnet-4-6": {"input": 3.00, "output": 15.00},
 }
 p = pricing.get(model, {"input": 3.00, "output": 15.00})
 total_input = num_requests * avg_input_tokens
 total_output = num_requests * avg_output_tokens
 standard_cost = (total_input / 1_000_000 * p["input"] + 
 total_output / 1_000_000 * p["output"])
 batch_cost = standard_cost * 0.50
 return {
 "standard_usd": round(standard_cost, 4),
 "batch_usd": round(batch_cost, 4),
 "savings_usd": round(standard_cost - batch_cost, 4),
 }

Comparing With llm-batch-coalesce

These are two different things that can be confused.

llm-batch-coalesce is about single-flighting concurrent requests to the same prompt. If ten different parts of your code call the model with the same prompt at the same time, llm-batch-coalesce detects the duplicate and makes one API call, sharing the result with all ten callers. This is a synchronous optimization, not the Batch API.

The Anthropic Message Batches API is about submitting many independent requests in one HTTP call and getting results back asynchronously. Different mechanism, different use case.

Use llm-batch-coalesce when you have concurrent code making duplicate synchronous calls. Use the Batch API when you have independent work that does not need to complete within seconds.

Feature	llm-batch-coalesce	Anthropic Batch API
Latency	Synchronous (seconds)	Async (up to 24h)
Use case	Dedup concurrent calls	High-volume offline work
Cost benefit	No discount	50% discount
Max requests	N/A (per-request)	10,000 per batch
Streaming	Supported	Not supported
Result order	Immediate	Batch completion

Tradeoffs to Know

Debugging is harder. When a synchronous call fails, you find out immediately. When a batch request fails, you find out when the batch completes. If you submitted 1,000 requests and 50 failed, you need to identify which ones and why. Build retry logic to handle partial batch failures.

No SLA below 24 hours. Anthropic processes batches on a best-effort basis. Most small batches complete within an hour, but the guaranteed SLA is 24 hours. Do not use it when you need a result within the hour.

Results need parsing. The Batch API returns JSONL, not a simple list. Each line is a JSON object with a custom_id, a result type, and either content or an error. anthropic-batch-kit handles this parsing for you, but if you are rolling your own client, budget time for it.

Context windows still apply. Each request in the batch is still subject to the model's context window limit. You cannot use the Batch API to send a larger context than the model supports.

Quick Start

pip install anthropic-batch-kit

from anthropic_batch_kit import BatchKit
kit = BatchKit() # reads ANTHROPIC_API_KEY from env

# Submit
batch_id = kit.submit([
 {
 "custom_id": "item-1",
 "model": "claude-sonnet-4-6",
 "max_tokens": 128,
 "messages": [{"role": "user", "content": "What is 2+2?"}]
 }
])
# Later: retrieve
results = kit.poll_and_retrieve(batch_id)
print(results["item-1"]["content"])

Related Tools

Tool	Purpose
anthropic-batch-kit	Submit, poll, retrieve Anthropic batches
llm-batch-coalesce	Single-flight dedup for concurrent sync calls
llm-cost-cap	Pre-flight USD gate for synchronous calls
agenttrace	Per-run cost tracking for synchronous agents
llm-fallback-router	Provider failover for synchronous calls

What Is Next

The Batch API is one of the more underused cost levers available. Most teams default to synchronous calls everywhere even when the use case is async-friendly. If you have eval runs or bulk processing jobs, measure your current monthly spend, apply the 50% discount, and decide if the async complexity is worth it.

Source and examples are at MukundaKatta/anthropic-batch-kit on GitHub.