How to Add LLM Model Fallbacks in Python in 5 Min

DEV Community

The timeout=10 on each call prevents a slow provider from blocking the entire chain. If all models fail, the function raises a RuntimeError with a summary of every error.

Expected Output

When your primary model is rate-limited:

[FAIL] gpt-4o: Error code: 429 - Rate limit exceeded
[OK] Response from claude-3-5-sonnet-20241022
Answer: DNS translates human-readable domain names into IP addresses so your browser knows which server to contact.

The user never knows anything went wrong. Your agent stays alive.

Tips for Production

Order by cost. Put your cheapest model first and your most expensive last. Fallbacks should escalate cost, not the other way around.

Add jitter between retries. If the first model failed due to a rate limit, hitting it again immediately will fail again. A time.sleep(0.5) between attempts helps.

Log every fallback. That print statement should be a structured log in production. If you are falling back 40% of the time, your primary provider has a problem.

This is part of the AI Agent Quick Tips series -- short, copy-paste patterns for building production agents. Previous tips cover retry logic, streaming responses, and guardrails.

If managing provider keys and fallback chains sounds like plumbing you would rather skip, Nebula handles model routing for you -- but the pattern above works anywhere.