How to Add Retry Logic to LLM Calls in 5 Min

DEV Community

try/except with time.sleep(), you're doing it the hard way. Here's the fix in one decorator.

The Code

import openai
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type, before_sleep_log
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@retry(
 retry=retry_if_exception_type((
 openai.RateLimitError,
 openai.APITimeoutError,
 openai.APIConnectionError,
 )),
 wait=wait_exponential(multiplier=1, min=2, max=30),
 stop=stop_after_attempt(3),
 before_sleep=before_sleep_log(logger, logging.WARNING),
)
def call_llm(prompt: str) -> str:
 client = openai.OpenAI()
 response = client.chat.completions.create(
 model="gpt-4o",
 messages=[{"role": "user", "content": prompt}],
 )
 return response.choices[0].message.content
result = call_llm("Explain retry logic in one sentence.")
print(result)

Install the dependency:

pip install tenacity openai

Run it. If OpenAI returns a rate limit error, the call waits 2 seconds, then 4, then 8 -- and retries up to 3 times. If it still fails after 3 attempts, the exception propagates normally.

How It Works

retry_if_exception_type tells tenacity which errors to retry. We target three specific OpenAI exceptions:

RateLimitError (429) -- you hit the token or request limit
APITimeoutError -- the request took too long
APIConnectionError -- network issues between you and OpenAI

All other errors (like AuthenticationError or BadRequestError) raise immediately. You don't want to retry a bad API key three times.

wait_exponential(multiplier=1, min=2, max=30) sets the backoff schedule. First retry waits 2 seconds, second waits 4 seconds, and it caps at 30 seconds. This is critical for rate limits -- hammering the API with instant retries makes the problem worse.

stop_after_attempt(3) caps the total attempts. Three retries is the sweet spot for most LLM calls. More than that usually means the issue isn't transient.

before_sleep_log logs a warning before each retry so you know exactly when and why retries happen. No silent failures.

What You're Replacing

Here's what most codebases have instead:

# Don't do this
import time
def call_llm_bad(prompt: str) -> str:
 for attempt in range(3):
 try:
 response = client.chat.completions.create(
 model="gpt-4o",
 messages=[{"role": "user", "content": prompt}],
 )
 return response.choices[0].message.content
 except Exception:
 time.sleep(2)
 raise Exception("LLM call failed after 3 attempts")

This catches every exception (including auth errors that will never self-resolve), uses fixed sleep (no backoff), and gives you zero visibility into what failed. The tenacity version is 4 lines shorter and handles all of this correctly.

Quick Customizations

Retry more aggressively for batch jobs:

@retry(
 retry=retry_if_exception_type(openai.RateLimitError),
 wait=wait_exponential(multiplier=2, min=4, max=60),
 stop=stop_after_attempt(5),
)

Add a callback when all retries fail:

from tenacity import retry, RetryError
try:
 result = call_llm("Your prompt here")
except RetryError:
 result = "Fallback: LLM unavailable. Using cached response."

The pattern works with any LLM provider -- swap openai.RateLimitError for the equivalent exception from Anthropic, Google, or your provider's SDK.

Next Steps

Retry logic is one piece of production-ready LLM code. For the full picture of what breaks in production agents and how to fix each failure mode, check out 5 AI Agent Failures in Production.

If you're building agents that chain multiple LLM calls and tool actions, Nebula handles retry and fallback logic automatically for every tool call -- no decorators needed.

Part of the AI Agent Quick Tips series.