AWS Bedrock prompt caching has a hidden cost most people miss

DEV Community

Over a few days of running this, I saw that my system prompts were getting evicted faster than I expected on Bedrock. The 5-minute TTL was the problem. My traffic was bursty and a lot of my prompts were arriving 6-7 minutes apart.

I batched my requests into tighter windows and the hit ratio went from 14% to 71%. Bill dropped to about 5% above where it was on the Anthropic API. That I could live with.

The JSON repair part

The other thing in this library is JSON repair. Bedrock will sometimes return Claude output wrapped in fenced code blocks even when you ask for JSON only. Or with a trailing comma. Or with the JSON object truncated because you hit max_tokens.

result = client.complete(
 messages=[...],
 response_format="json",
)
data = result.json # repaired, parsed dict

The repair runs three passes:

Strip markdown fences ( `

`and `

json `).

Find the largest balanced {...} or [...] substring.
Remove trailing commas inside objects/arrays.

If all three fail, you get the raw text and a JsonRepairError you can handle. About 95% of the malformed outputs I have seen are recoverable with these three passes. The remaining 5% are usually a truncated response, which is a real bug, not a parse bug, and you want to know about it.

Throttle handling

Bedrock throttles aggressively in some regions. The library has a built-in exponential backoff with jitter for ThrottlingException. Same defaults that work for me: 5 retries, base 1s, cap 30s, full jitter.

python client = BedrockClient( model="anthropic.claude-3-5-sonnet-20241022-v2:0", retry_throttle=True, # default max_retries=5, )

If you want to handle it yourself, set retry_throttle=False and you get the raw boto3 error.

What this is not

This is not a full SDK. It is a thin wrapper around boto3 with a few quality-of-life things. If you want streaming, raw byte access, or anything past the basic complete-and-cost flow, drop down to boto3 directly. The client exposes the underlying session so you can mix.

It also does not call out the cache cost story automatically. You have to log the cache_stats field and look at it. I would like to add a background warning when status is COLD for more than N requests in a row. That is on the list.

Repo

GitHub: https://github.com/MukundaKatta/bedrock-kit
PyPI: pip install bedrock-kit

If you are on Bedrock and you turned on prompt caching because it was free on the Anthropic side, check your hit ratio. The breakeven on Bedrock is higher than you think.