machine learning +
OpenAI API Python Tutorial – A Complete Crash Course
LLM API Router: Groq, Together AI & OpenRouter
Build a multi-provider LLM router in Python with cost-based routing, latency tracking, and automatic fallbacks across Groq, Together AI, and OpenRouter.
Build a smart router that picks the cheapest or fastest LLM provider for each request — and falls back automatically when one goes down.
⚡ This post has interactive code — click ▶ Run or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.
You’ve got three LLM providers. Groq is blazing fast. Together AI runs open-source models at solid prices. OpenRouter gives you a single gateway to hundreds of models.
But what happens when Groq hits its rate limit at 2 AM and your app goes silent?
That’s the problem a multi-provider router solves. You define a priority chain. The router tries the first provider. If it fails, it moves to the next. Your users never notice. In this tutorial, you’ll build that router from scratch using raw HTTP requests.
What Is a Multi-Provider LLM Router?
A multi-provider LLM router sits between your application and multiple LLM APIs. It decides which provider handles each request.
Think of it like a load balancer for AI. A web load balancer distributes traffic across servers. An LLM router distributes inference requests across providers like Groq, Together AI, and OpenRouter.
Why would you need one? Three reasons:
| Reason | What It Solves |
|---|---|
| Reliability | If Provider A goes down, Provider B takes over |
| Cost optimization | Route cheap requests to the cheapest provider |
| Latency optimization | Route time-sensitive requests to the fastest one |
The router doesn’t change what you send. It translates your prompt into each provider’s API format, tries them in order, and returns the first successful response.
Key Insight: A multi-provider router turns unreliable individual providers into a reliable system. No single provider guarantees 100% uptime — but three providers with automatic fallback come close.
The Three Providers
Before we write code, you need to understand what makes each provider different. They’re not interchangeable — each has a sweet spot.
| Provider | Strengths | Weaknesses | Best For |
|---|---|---|---|
| Groq | Ultra-low latency (0.13s TTFT), free tier | Tight rate limits, fewer models | Speed-critical requests |
| Together AI | Fixed pricing, 200+ open models | No proprietary models (no GPT/Claude) | Cost-predictable open-source inference |
| OpenRouter | 500+ models, built-in fallback | 5-10% markup, 25-40ms routing overhead | Universal fallback, model variety |
Groq runs inference on custom LPU (Language Processing Unit) hardware. It’s the speed king. First-token latency hits 0.13 seconds on short prompts. The free tier works without a credit card, though rate limits are tight.
Together AI hosts 200+ open-source models on its own GPU clusters. No routing markup — you pay fixed rates per model. It’s your best bet for a specific open-source model like Llama 3 at a predictable price.
OpenRouter is a gateway, not a provider. It doesn’t run models itself. It routes your request to whichever provider offers the best deal — and adds a 5-10% markup. Think of it as the fallback of last resort.
Prerequisites
- Python version: 3.10+
- Required library:
requests(pip install requests) - API keys: One from each provider (setup below)
- Time to complete: ~25 minutes
- Cost: Under $0.01 total (all three have free tiers or credits)
Pyodide note: The
requestslibrary doesn’t work natively in browser-based Python (Pyodide). You’d need thepyodide-httppatch. The code in this tutorial targets standard Python environments.
Get Your API Keys
Each provider needs its own API key. Here’s where to get them.
Groq: Go to console.groq.com → API Keys → Create API Key. No credit card needed for the free tier.
Together AI: Go to api.together.xyz → Settings → API Keys → Generate. You’ll get free credits on signup.
OpenRouter: Go to openrouter.ai/keys → Create a new key. Add credits or use the free tier.
Store all three keys as environment variables. Never hardcode them in scripts.
import micropip
await micropip.install(['requests'])
import os
from js import prompt
GROQ_API_KEY = prompt("Enter your Groq API key:")
os.environ["GROQ_API_KEY"] = GROQ_API_KEY
TOGETHER_API_KEY = prompt("Enter your Together AI API key:")
os.environ["TOGETHER_API_KEY"] = TOGETHER_API_KEY
OPENROUTER_API_KEY = prompt("Enter your OpenRouter API key:")
os.environ["OPENROUTER_API_KEY"] = OPENROUTER_API_KEY
import os
import requests
import time
import json
# Load API keys from environment variables
# Create a .env file with: GROQ_API_KEY, TOGETHER_API_KEY, OPENROUTER_API_KEY
GROQ_API_KEY = os.environ.get("GROQ_API_KEY", "your_groq_key_here")
TOGETHER_API_KEY = os.environ.get("TOGETHER_API_KEY", "your_together_key_here")
OPENROUTER_API_KEY = os.environ.get("OPENROUTER_API_KEY", "your_openrouter_key_here")
print("Keys loaded successfully")python
Keys loaded successfully
Tip: Use `python-dotenv` for local development. Install it with `pip install python-dotenv`, create a `.env` file in your project root, and add `from dotenv import load_dotenv; load_dotenv()` at the top of your script.
Call Each Provider with Raw HTTP
Every LLM provider exposes a REST API. You send a POST request with your prompt. You get back JSON with the response. All three providers use the same format — OpenAI-compatible chat completions. Same structure. Different endpoints and model names.
Here’s the pattern each call follows:
- Build the URL, headers, and JSON payload
- Time the request with
time.time() - Check the status code
- Return a standardized result dictionary
Calling Groq
Groq’s endpoint is https://api.groq.com/openai/v1/chat/completions. We’ll use llama-3.1-8b-instant — one of the fastest models on their free tier. The function sends the prompt, measures latency, and returns a clean result dict.
def call_groq(prompt, model="llama-3.1-8b-instant", max_tokens=256):
"""Send a chat completion request to Groq."""
url = "https://api.groq.com/openai/v1/chat/completions"
headers = {
"Authorization": f"Bearer {GROQ_API_KEY}",
"Content-Type": "application/json"
}
payload = {
"model": model,
"messages": [{"role": "user", "content": prompt}],
"max_tokens": max_tokens,
"temperature": 0.7
}
start = time.time()
response = requests.post(url, headers=headers, json=payload, timeout=30)
elapsed = time.time() - start
if response.status_code == 200:
data = response.json()
return {
"provider": "groq", "model": model,
"content": data["choices"][0]["message"]["content"],
"latency_ms": round(elapsed * 1000),
"tokens_used": data.get("usage", {})
}
raise Exception(f"Groq error {response.status_code}: {response.text}")Quick check — what does this function return on success? A dictionary with five keys: provider, model, content, latency_ms, and tokens_used. That consistent shape matters. Every provider returns the same structure. So the router can treat them all the same way.
Warning: Groq’s free tier limits you to 30 requests/minute and 14,400 tokens/minute. Exceed these and you’ll get a 429 status code. Our router will catch this and fall back automatically.
Calling Together AI and OpenRouter
Together AI and OpenRouter follow the same pattern. Only the URL, headers, and model name change. Together AI lives at https://api.together.xyz/v1/chat/completions. OpenRouter lives at https://openrouter.ai/api/v1/chat/completions and needs two extra headers: HTTP-Referer and X-Title.
def call_together(prompt, model="meta-llama/Llama-3.1-8B-Instruct-Turbo", max_tokens=256):
"""Send a chat completion request to Together AI."""
url = "https://api.together.xyz/v1/chat/completions"
headers = {
"Authorization": f"Bearer {TOGETHER_API_KEY}",
"Content-Type": "application/json"
}
payload = {
"model": model,
"messages": [{"role": "user", "content": prompt}],
"max_tokens": max_tokens, "temperature": 0.7
}
start = time.time()
response = requests.post(url, headers=headers, json=payload, timeout=30)
elapsed = time.time() - start
if response.status_code == 200:
data = response.json()
return {
"provider": "together", "model": model,
"content": data["choices"][0]["message"]["content"],
"latency_ms": round(elapsed * 1000),
"tokens_used": data.get("usage", {})
}
raise Exception(f"Together error {response.status_code}: {response.text}")
def call_openrouter(prompt, model="meta-llama/llama-3.1-8b-instruct", max_tokens=256):
"""Send a chat completion request to OpenRouter."""
url = "https://openrouter.ai/api/v1/chat/completions"
headers = {
"Authorization": f"Bearer {OPENROUTER_API_KEY}",
"Content-Type": "application/json",
"HTTP-Referer": "https://machinelearningplus.com",
"X-Title": "MLPlus LLM Router Tutorial"
}
payload = {
"model": model,
"messages": [{"role": "user", "content": prompt}],
"max_tokens": max_tokens, "temperature": 0.7
}
start = time.time()
response = requests.post(url, headers=headers, json=payload, timeout=30)
elapsed = time.time() - start
if response.status_code == 200:
data = response.json()
return {
"provider": "openrouter", "model": model,
"content": data["choices"][0]["message"]["content"],
"latency_ms": round(elapsed * 1000),
"tokens_used": data.get("usage", {})
}
raise Exception(f"OpenRouter error {response.status_code}: {response.text}")
print("All three provider functions defined")python
All three provider functions defined
See the pattern? Same payload. Same response parsing. Same result dictionary. The only differences are the URL, the API key, and the model name. That’s the beauty of the OpenAI-compatible format. Most providers adopted it.
Build the Basic Router with Fallbacks
Here’s where it all comes together. The router tries providers in order. If one fails — timeout, rate limit, server error — it catches the error and moves to the next. If all fail, it raises a clear exception.
The route_request function loops through provider functions. The first successful response wins. Every failure gets logged so you can debug later.
def route_request(prompt, providers=None, max_tokens=256):
"""
Try each provider in order. Return the first successful response.
Falls back automatically on any error.
"""
if providers is None:
providers = [
("groq", call_groq),
("together", call_together),
("openrouter", call_openrouter)
]
errors = []
for name, call_fn in providers:
try:
result = call_fn(prompt, max_tokens=max_tokens)
result["fallback_errors"] = errors
return result
except Exception as e:
errors.append({"provider": name, "error": str(e)})
print(f"[FALLBACK] {name} failed: {e}")
raise Exception(f"All providers failed: {errors}")Let’s test it. If your Groq key is valid, it should respond first. If not, the router falls back.
result = route_request("Explain gradient descent in two sentences.")
print(f"Provider used: {result['provider']}")
print(f"Latency: {result['latency_ms']}ms")
print(f"Response: {result['content'][:200]}")
if result["fallback_errors"]:
print(f"Fallbacks triggered: {len(result['fallback_errors'])}")That’s the entire fallback mechanism in under 20 lines. No external libraries. No complex configuration. Just a try/except loop with logging.
Key Insight: The fallback order defines your cost-speed tradeoff. Groq first means you optimize for speed. Together AI first means you optimize for cost. OpenRouter last means you always have a safety net.
Predict the output: if Groq returns a 429 error and Together AI succeeds, what will result["provider"] be? It’ll be "together". And result["fallback_errors"] will contain one entry with the Groq error.
Add Cost-Based Routing
The basic router always tries providers in the same order. But what if you want the cheapest available option? Cost-based routing sorts providers by price before trying them.
We’ll define a price table with per-token costs. The router sorts by cost and tries the cheapest first. Here are rough rates as of early 2026. Check each provider’s pricing page for current numbers.
PROVIDER_PRICING = {
"groq": {
"model": "llama-3.1-8b-instant",
"input_per_million": 0.05,
"output_per_million": 0.08,
"call_fn": call_groq
},
"together": {
"model": "meta-llama/Llama-3.1-8B-Instruct-Turbo",
"input_per_million": 0.18,
"output_per_million": 0.18,
"call_fn": call_together
},
"openrouter": {
# Note: OpenRouter adds 5-10% markup on top of base provider pricing
"model": "meta-llama/llama-3.1-8b-instruct",
"input_per_million": 0.06,
"output_per_million": 0.06,
"call_fn": call_openrouter
}
}
def estimate_cost(provider_name, input_tokens=100, output_tokens=256):
"""Estimate the cost for a request given token counts."""
pricing = PROVIDER_PRICING[provider_name]
input_cost = (input_tokens / 1_000_000) * pricing["input_per_million"]
output_cost = (output_tokens / 1_000_000) * pricing["output_per_million"]
return input_cost + output_cost
for name in sorted(PROVIDER_PRICING, key=lambda n: estimate_cost(n)):
cost = estimate_cost(name)
print(f"{name:12s}: ${cost:.8f} per request (est. 100 in + 256 out)")python
openrouter : $0.00002136 per request (est. 100 in + 256 out)
groq : $0.00002548 per request (est. 100 in + 256 out)
together : $0.00006408 per request (est. 100 in + 256 out)
OpenRouter’s listed rate is lowest here. But the 5-10% markup isn’t fully reflected. In practice, Groq’s free tier makes it zero cost until you exceed rate limits.
Now the cost-based router. It sorts providers by estimated cost, then calls route_request with that ordering.
def route_by_cost(prompt, input_tokens=100, output_tokens=256, max_tokens=256):
"""Route to the cheapest provider first. Fall back on failure."""
ranked = sorted(
PROVIDER_PRICING.items(),
key=lambda item: estimate_cost(item[0], input_tokens, output_tokens)
)
providers = [(name, info["call_fn"]) for name, info in ranked]
print(f"Cost-based order: {[p[0] for p in providers]}")
return route_request(prompt, providers=providers, max_tokens=max_tokens)
result = route_by_cost("What is backpropagation in one sentence?")
print(f"Cheapest available: {result['provider']}")Tip: Track actual costs, not estimates. Each provider returns token usage in the response. Log it. After a few hundred requests, you’ll have real cost data per provider — and it’ll differ from published rates.
Exercise 1: Build a Latency-Based Router
You’ve seen cost-based routing. Now build a latency-based version. Sort providers by recent average latency and try the fastest first.
Compare Providers Side by Side
Before you deploy a router in production, benchmark your providers. Let’s send the same prompt to all three and compare latency, cost, and response quality in a single table.
The benchmark_providers function calls every provider with the same input. It shows four columns: provider name, latency in ms, estimated cost, and a response preview. Latency will vary based on your location and time of day.
def benchmark_providers(prompt, max_tokens=256):
"""Call all providers with the same prompt and compare."""
results = []
for name, info in PROVIDER_PRICING.items():
try:
result = info["call_fn"](prompt, max_tokens=max_tokens)
cost = estimate_cost(name,
result["tokens_used"].get("prompt_tokens", 100),
result["tokens_used"].get("completion_tokens", 256))
result["est_cost"] = cost
results.append(result)
except Exception as e:
results.append({"provider": name, "error": str(e),
"latency_ms": -1, "est_cost": 0})
print(f"\n{'Provider':<12} {'Latency':<10} {'Est Cost':<15} {'Preview'}")
print("-" * 70)
for r in results:
if "error" in r:
print(f"{r['provider']:<12} {'FAILED':<10} {'-':<15} {r['error'][:35]}")
else:
preview = r["content"][:35].replace("\n", " ")
print(f"{r['provider']:<12} {r['latency_ms']:<10}ms ${r['est_cost']:<14.8f} {preview}...")
return results
results = benchmark_providers("Explain what a neural network is in 2 sentences.")You’ll likely see Groq winning on latency by a wide margin. Together AI should be competitive on cost. OpenRouter sits in the middle — it adds routing overhead but gives you the widest model selection.
Handle Rate Limits and Errors Gracefully
Ever seen a 429 status code at 3 AM with users waiting? Rate limits are the most common failure mode in multi-provider setups. Every provider has them. When you hit one, you get that 429 HTTP status code — often with a Retry-After header.
A good router tells two error types apart. Transient errors (rate limits, timeouts, 503s) deserve a fallback. Permanent errors (bad API key, model not found) should fail fast. Let’s define custom exceptions for each.
class RateLimitError(Exception):
"""Raised when a provider returns 429 Too Many Requests."""
def __init__(self, provider, retry_after=None):
self.provider = provider
self.retry_after = retry_after
super().__init__(f"{provider} rate limited. Retry after: {retry_after}s")
class ProviderError(Exception):
"""Raised for non-retryable errors (auth, model not found)."""
def __init__(self, provider, status_code, message):
self.provider = provider
self.status_code = status_code
super().__init__(f"{provider} error {status_code}: {message}")Now a wrapper function that classifies errors. It calls any provider function and translates HTTP status codes into our custom exceptions.
def call_provider_safe(name, call_fn, prompt, max_tokens=256):
"""Wrap a provider call with structured error handling."""
try:
return call_fn(prompt, max_tokens=max_tokens)
except Exception as e:
error_str = str(e)
if "429" in error_str:
raise RateLimitError(name, retry_after=60)
elif "401" in error_str or "403" in error_str:
raise ProviderError(name, 401, "Invalid API key")
elif "404" in error_str:
raise ProviderError(name, 404, "Model not found")
else:
raise
print("Error classes and safe wrapper defined")python
Error classes and safe wrapper defined
The enhanced router uses these error classes. Rate-limited providers get a fallback. Permanently broken providers get disabled for the session. Unknown errors trigger a fallback too.
def route_with_error_handling(prompt, max_tokens=256):
"""Smart router with error classification and fallback."""
providers = [
("groq", call_groq),
("together", call_together),
("openrouter", call_openrouter)
]
disabled = set()
errors = []
for name, call_fn in providers:
if name in disabled:
continue
try:
result = call_provider_safe(name, call_fn, prompt, max_tokens)
result["fallback_errors"] = errors
return result
except RateLimitError as e:
errors.append({"provider": name, "type": "rate_limit"})
print(f"[RATE LIMITED] {name} — falling back")
except ProviderError as e:
errors.append({"provider": name, "type": "permanent"})
disabled.add(name)
print(f"[DISABLED] {name} — {e}")
except Exception as e:
errors.append({"provider": name, "type": "unknown"})
print(f"[ERROR] {name} — {e}")
raise Exception(f"All providers failed: {errors}")
result = route_with_error_handling("What is fine-tuning?")
print(f"Provider: {result['provider']}")Warning: Don’t retry rate-limited providers immediately. If Groq returns 429, hammering it again makes things worse. Fall back instead. In production, add a cooldown timer — skip the rate-limited provider for 60 seconds.
Exercise 2: Add a Cooldown Timer
The current router has no cooldown for rate limits. Add a time-based cooldown so rate-limited providers get re-enabled automatically after a waiting period.
Build a Production-Ready Router Class
Let’s combine everything into one class. The LLMRouter supports multiple routing strategies, automatic fallbacks, cooldown timers, and request logging.
The constructor takes a list of provider configs and a default strategy. Each config includes the name, call function, and pricing.
class LLMRouter:
"""Multi-provider LLM router with fallbacks and routing strategies."""
def __init__(self, providers, default_strategy="cost"):
self.providers = providers
self.default_strategy = default_strategy
self.cooldowns = {}
self.latency_history = {p["name"]: [] for p in providers}
self.request_log = []
self.cooldown_seconds = 60The helper methods handle cooldowns, latency tracking, and ranking. _rank_providers sorts available providers by the chosen strategy.
def _is_available(self, name):
if name not in self.cooldowns:
return True
if time.time() >= self.cooldowns[name]:
del self.cooldowns[name]
return True
return False
def _set_cooldown(self, name):
self.cooldowns[name] = time.time() + self.cooldown_seconds
def _get_avg_latency(self, name):
history = self.latency_history[name]
if not history:
return 9999
recent = history[-10:] # Last 10 requests
return sum(recent) / len(recent)
def _rank_providers(self, strategy):
available = [p for p in self.providers if self._is_available(p["name"])]
if strategy == "cost":
return sorted(available, key=lambda p: p["cost_per_million_output"])
elif strategy == "latency":
return sorted(available, key=lambda p: self._get_avg_latency(p["name"]))
return available # "priority" — use defined orderThe route method is the main entry point. It ranks providers, tries each one, and handles fallbacks. Success logs the latency. A 429 error sets a cooldown.
def route(self, prompt, strategy=None, max_tokens=256):
strategy = strategy or self.default_strategy
ranked = self._rank_providers(strategy)
if not ranked:
raise Exception("No providers available (all in cooldown)")
errors = []
for provider in ranked:
name = provider["name"]
try:
result = provider["call_fn"](prompt, max_tokens=max_tokens)
self.latency_history[name].append(result["latency_ms"])
self.request_log.append({
"provider": name, "strategy": strategy,
"latency_ms": result["latency_ms"], "success": True
})
result["fallback_errors"] = errors
result["strategy_used"] = strategy
return result
except Exception as e:
errors.append({"provider": name, "error": str(e)})
if "429" in str(e):
self._set_cooldown(name)
self.request_log.append({
"provider": name, "strategy": strategy,
"error": str(e), "success": False
})
raise Exception(f"All providers failed: {errors}")
def get_stats(self):
"""Return routing statistics."""
total = len(self.request_log)
successes = sum(1 for r in self.request_log if r["success"])
by_provider = {}
for r in self.request_log:
name = r["provider"]
if name not in by_provider:
by_provider[name] = {"success": 0, "fail": 0}
key = "success" if r["success"] else "fail"
by_provider[name][key] += 1
return {
"total_requests": total,
"success_rate": f"{successes/total*100:.1f}%" if total else "N/A",
"by_provider": by_provider,
"avg_latency": {
n: f"{self._get_avg_latency(n):.0f}ms"
for n in self.latency_history
}
}
print("LLMRouter class defined")python
LLMRouter class defined
Note: In a real project, you’d put this class in a single file. We split it here for readability — each block covers one responsibility.
Now let’s put it to work. We’ll configure all three providers and test different routing strategies.
router = LLMRouter(
providers=[
{"name": "groq", "call_fn": call_groq, "cost_per_million_output": 0.08},
{"name": "together", "call_fn": call_together, "cost_per_million_output": 0.18},
{"name": "openrouter", "call_fn": call_openrouter, "cost_per_million_output": 0.06}
],
default_strategy="cost"
)
# Cost-based routing
result = router.route("What is transfer learning?", strategy="cost")
print(f"Cost routing -> {result['provider']} ({result['latency_ms']}ms)")
# Latency-based routing
result = router.route("What is a loss function?", strategy="latency")
print(f"Latency routing -> {result['provider']} ({result['latency_ms']}ms)")
# Priority-based routing (use defined order)
result = router.route("What is batch normalization?", strategy="priority")
print(f"Priority routing -> {result['provider']} ({result['latency_ms']}ms)")
print(f"\nRouter stats:")
print(json.dumps(router.get_stats(), indent=2))You can swap strategies per request. Use cost routing for batch jobs. Use latency routing for user-facing responses. Use priority for maximum control.
Exercise 3: Add Weighted Random Routing
Add a "weighted" strategy where providers are chosen randomly but weighted by a reliability score. Higher weight means more likely to be chosen first.
What Is LiteLLM? A Higher-Level Alternative
Everything we’ve built uses raw HTTP. That’s great for learning and for full control. But there’s a popular library that handles multi-provider routing out of the box.
LiteLLM is an open-source Python SDK. It wraps 100+ providers behind one completion() call. Routing, fallbacks, load balancing, cost tracking — it does all of it. Here’s what our router looks like with LiteLLM. This is a preview, not runnable code.
# Preview only — requires: pip install litellm
# LiteLLM doesn't work in Pyodide (needs httpx)
# from litellm import Router
#
# router = Router(
# model_list=[
# {"model_name": "fast-llm", "litellm_params": {
# "model": "groq/llama-3.1-8b-instant", "api_key": GROQ_API_KEY}},
# {"model_name": "fast-llm", "litellm_params": {
# "model": "together_ai/meta-llama/Llama-3.1-8B-Instruct-Turbo",
# "api_key": TOGETHER_API_KEY}},
# {"model_name": "fast-llm", "litellm_params": {
# "model": "openrouter/meta-llama/llama-3.1-8b-instruct",
# "api_key": OPENROUTER_API_KEY}}
# ],
# routing_strategy="cost-based-routing",
# fallbacks=[{"fast-llm": ["fast-llm"]}]
# )
print("LiteLLM preview — see docs.litellm.ai/docs/routing")python
LiteLLM preview — see docs.litellm.ai/docs/routing
LiteLLM groups all three providers under one virtual model name (fast-llm). It supports six strategies: simple-shuffle, least-busy, usage-based-routing, usage-based-routing-v2, latency-based-routing, and cost-based-routing. It also supports async calls via router.acompletion(). Useful if you’re building with asyncio or FastAPI.
Should you use LiteLLM or build your own? Here’s how I think about it:
| Scenario | Best Approach |
|---|---|
| Learning how routing works | Build your own (this tutorial) |
| Quick prototype, < 10K req/day | LiteLLM — saves time |
| Production with custom logic | Build your own — full control |
| Production with standard routing | LiteLLM — battle-tested |
Other Routing Frameworks Worth Knowing
LiteLLM isn’t the only game in town. Two other frameworks take different approaches:
- RouteLLM (by LMSYS) routes between a strong model and a weak model based on query complexity. It claims 85% cost savings while keeping 95% of GPT-4 quality. Send easy questions to a cheap model. Send hard ones to an expensive one.
- NVIDIA LLM Router is an enterprise blueprint that routes requests based on task difficulty. It targets production deployments with GPU infrastructure.
Both solve a different problem than our router. We route across providers (Groq vs Together vs OpenRouter). They route across model tiers (cheap vs expensive). You could combine both. Use our provider router inside a model-tier router.
When NOT to Build Your Own Router
Not every project needs a router. Here are cases where simpler approaches work better.
Use OpenRouter alone when you don’t care which provider serves the request. OpenRouter already IS a router — it picks the cheapest provider internally. Adding your own layer on top is double-routing.
Use a single provider when your traffic is predictable. If you’re making 100 requests a day on Groq’s free tier, you don’t need fallbacks. Keep it simple.
Use LiteLLM when you need production routing quickly but don’t have custom requirements. It handles rate limits, retries, cost tracking, and 100+ providers out of the box.
Tip: Start simple, add complexity when you need it. Most projects don’t need a router on day one. Start with one provider. Add a second when you hit rate limits. Build a full router only when simpler approaches fall short.
Common Mistakes and Troubleshooting
Here are the errors you’ll hit when working with multi-provider setups.
Mistake 1: Not handling timeouts separately from other errors
# Bad — everything gets the same treatment
try:
result = call_groq(prompt)
except Exception:
result = call_together(prompt)
# Better — timeouts get a shorter retry window
try:
result = call_groq(prompt)
except requests.exceptions.Timeout:
print("Groq timed out — falling back fast")
result = call_together(prompt)
except Exception as e:
print(f"Groq error: {e} — falling back")
result = call_together(prompt)Mistake 2: Ignoring the Retry-After header
When a provider returns 429, it often includes a Retry-After header. That header tells you exactly how long to wait. Ignore it and you keep hammering a rate-limited endpoint.
response = requests.post(url, headers=headers, json=payload)
if response.status_code == 429:
retry_after = int(response.headers.get("Retry-After", 60))
print(f"Rate limited — server says wait {retry_after}s")Mistake 3: Using wrong model names across providers
Each provider has its own naming scheme. llama-3.1-8b-instant works on Groq but not Together AI. Your router needs a model name mapping.
MODEL_MAP = {
"llama-3.1-8b": {
"groq": "llama-3.1-8b-instant",
"together": "meta-llama/Llama-3.1-8B-Instruct-Turbo",
"openrouter": "meta-llama/llama-3.1-8b-instruct"
}
}
def get_model_name(logical_name, provider):
return MODEL_MAP.get(logical_name, {}).get(provider, logical_name)
print(get_model_name("llama-3.1-8b", "groq"))
print(get_model_name("llama-3.1-8b", "together"))
print(get_model_name("llama-3.1-8b", "openrouter"))python
llama-3.1-8b-instant
meta-llama/Llama-3.1-8B-Instruct-Turbo
meta-llama/llama-3.1-8b-instruct
Mistake 4: Not logging fallback events
If your router silently falls back, you’ll never know there’s a problem. Always log which provider handled each request. Without logs, you’re flying blind.
Summary
You built a multi-provider LLM router from scratch. Here’s what you covered:
- Three providers: Groq (speed), Together AI (cost-predictable open-source), OpenRouter (universal gateway)
- Raw HTTP calls: Each uses the OpenAI-compatible chat completions format
- Basic fallback: Try providers in order, catch errors, move to the next
- Cost-based routing: Sort by price, cheapest first
- Latency-based routing: Track response times, route to the fastest
- Error classification: Separate rate limits from permanent errors
- Cooldown timers: Skip rate-limited providers temporarily
- Production router class: Combines all strategies with logging and stats
Practice Exercise
FAQ
Can I mix proprietary and open-source models in one router?
Yes, and it’s one of the best use cases. Route cheap tasks (summarization, classification) to open-source models on Together AI. Route complex tasks (reasoning, code generation) to GPT-4o via OpenRouter.
def route_by_task(prompt, task_type="general"):
if task_type == "simple":
return route_request(prompt, providers=[
("together", call_together), ("groq", call_groq)])
return route_request(prompt, providers=[
("openrouter", call_openrouter), ("together", call_together)])What’s the latency overhead of a router layer?
Almost zero. The routing logic (sorting, checking cooldowns) takes microseconds. The real latency comes from HTTP requests. A failed attempt adds one round-trip before fallback. That’s why short timeouts (5-10 seconds) matter.
Do I need all three providers?
Two is plenty for most cases. Groq + OpenRouter gives you speed plus universal fallback. Together AI + Groq gives you cost optimization plus speed. Add the third when you need it.
How do I handle streaming with a router?
Streaming adds complexity. You don’t know if a provider failed until you’re partway through. The safest approach: send a non-streaming “ping” first (1 token), then stream from the same provider. Or accept that mid-stream fallback means restarting.
References
- Groq API documentation — Chat Completions. Link
- Together AI documentation — Inference API. Link
- OpenRouter documentation — Provider Routing. Link
- LiteLLM documentation — Router and Load Balancing. Link
- LiteLLM documentation — Fallbacks and Reliability. Link
- OpenRouter documentation — Quickstart Guide. Link
- Groq pricing — Tokens as a Service. Link
- Together AI — Serverless Inference. Link
- RouteLLM — A Framework for Serving and Evaluating LLM Routers. Link
- NVIDIA LLM Router Blueprint — Route LLM Requests to the Best Model. Link
Complete Code
Free Course
Master Core Python — Your First Step into AI/ML
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
