Menu

LLM API Router: Groq, Together AI & OpenRouter

Build a multi-provider LLM router in Python with cost-based routing, latency tracking, and automatic fallbacks across Groq, Together AI, and OpenRouter.

Written by Selva Prabhakaran | 31 min read

Build a smart router that picks the cheapest or fastest LLM provider for each request — and falls back automatically when one goes down.

⚡ This post has interactive code — click ▶ Run or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.

You’ve got three LLM providers. Groq is blazing fast. Together AI runs open-source models at solid prices. OpenRouter gives you a single gateway to hundreds of models.

But what happens when Groq hits its rate limit at 2 AM and your app goes silent?

That’s the problem a multi-provider router solves. You define a priority chain. The router tries the first provider. If it fails, it moves to the next. Your users never notice. In this tutorial, you’ll build that router from scratch using raw HTTP requests.

What Is a Multi-Provider LLM Router?

A multi-provider LLM router sits between your application and multiple LLM APIs. It decides which provider handles each request.

Think of it like a load balancer for AI. A web load balancer distributes traffic across servers. An LLM router distributes inference requests across providers like Groq, Together AI, and OpenRouter.

Why would you need one? Three reasons:

ReasonWhat It Solves
ReliabilityIf Provider A goes down, Provider B takes over
Cost optimizationRoute cheap requests to the cheapest provider
Latency optimizationRoute time-sensitive requests to the fastest one

The router doesn’t change what you send. It translates your prompt into each provider’s API format, tries them in order, and returns the first successful response.

Key Insight: A multi-provider router turns unreliable individual providers into a reliable system. No single provider guarantees 100% uptime — but three providers with automatic fallback come close.

The Three Providers

Before we write code, you need to understand what makes each provider different. They’re not interchangeable — each has a sweet spot.

ProviderStrengthsWeaknessesBest For
GroqUltra-low latency (0.13s TTFT), free tierTight rate limits, fewer modelsSpeed-critical requests
Together AIFixed pricing, 200+ open modelsNo proprietary models (no GPT/Claude)Cost-predictable open-source inference
OpenRouter500+ models, built-in fallback5-10% markup, 25-40ms routing overheadUniversal fallback, model variety

Groq runs inference on custom LPU (Language Processing Unit) hardware. It’s the speed king. First-token latency hits 0.13 seconds on short prompts. The free tier works without a credit card, though rate limits are tight.

Together AI hosts 200+ open-source models on its own GPU clusters. No routing markup — you pay fixed rates per model. It’s your best bet for a specific open-source model like Llama 3 at a predictable price.

OpenRouter is a gateway, not a provider. It doesn’t run models itself. It routes your request to whichever provider offers the best deal — and adds a 5-10% markup. Think of it as the fallback of last resort.

Prerequisites

  • Python version: 3.10+
  • Required library: requests (pip install requests)
  • API keys: One from each provider (setup below)
  • Time to complete: ~25 minutes
  • Cost: Under $0.01 total (all three have free tiers or credits)

Pyodide note: The requests library doesn’t work natively in browser-based Python (Pyodide). You’d need the pyodide-http patch. The code in this tutorial targets standard Python environments.

Get Your API Keys

Each provider needs its own API key. Here’s where to get them.

Groq: Go to console.groq.com → API Keys → Create API Key. No credit card needed for the free tier.

Together AI: Go to api.together.xyz → Settings → API Keys → Generate. You’ll get free credits on signup.

OpenRouter: Go to openrouter.ai/keys → Create a new key. Add credits or use the free tier.

Store all three keys as environment variables. Never hardcode them in scripts.

import micropip
await micropip.install(['requests'])

import os
from js import prompt
GROQ_API_KEY = prompt("Enter your Groq API key:")
os.environ["GROQ_API_KEY"] = GROQ_API_KEY
TOGETHER_API_KEY = prompt("Enter your Together AI API key:")
os.environ["TOGETHER_API_KEY"] = TOGETHER_API_KEY
OPENROUTER_API_KEY = prompt("Enter your OpenRouter API key:")
os.environ["OPENROUTER_API_KEY"] = OPENROUTER_API_KEY

import os
import requests
import time
import json

# Load API keys from environment variables
# Create a .env file with: GROQ_API_KEY, TOGETHER_API_KEY, OPENROUTER_API_KEY
GROQ_API_KEY = os.environ.get("GROQ_API_KEY", "your_groq_key_here")
TOGETHER_API_KEY = os.environ.get("TOGETHER_API_KEY", "your_together_key_here")
OPENROUTER_API_KEY = os.environ.get("OPENROUTER_API_KEY", "your_openrouter_key_here")

print("Keys loaded successfully")
python
Keys loaded successfully
Tip: Use `python-dotenv` for local development. Install it with `pip install python-dotenv`, create a `.env` file in your project root, and add `from dotenv import load_dotenv; load_dotenv()` at the top of your script.

Call Each Provider with Raw HTTP

Every LLM provider exposes a REST API. You send a POST request with your prompt. You get back JSON with the response. All three providers use the same format — OpenAI-compatible chat completions. Same structure. Different endpoints and model names.

Here’s the pattern each call follows:

  1. Build the URL, headers, and JSON payload
  2. Time the request with time.time()
  3. Check the status code
  4. Return a standardized result dictionary

Calling Groq

Groq’s endpoint is https://api.groq.com/openai/v1/chat/completions. We’ll use llama-3.1-8b-instant — one of the fastest models on their free tier. The function sends the prompt, measures latency, and returns a clean result dict.

def call_groq(prompt, model="llama-3.1-8b-instant", max_tokens=256):
    """Send a chat completion request to Groq."""
    url = "https://api.groq.com/openai/v1/chat/completions"
    headers = {
        "Authorization": f"Bearer {GROQ_API_KEY}",
        "Content-Type": "application/json"
    }
    payload = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": max_tokens,
        "temperature": 0.7
    }
    start = time.time()
    response = requests.post(url, headers=headers, json=payload, timeout=30)
    elapsed = time.time() - start

    if response.status_code == 200:
        data = response.json()
        return {
            "provider": "groq", "model": model,
            "content": data["choices"][0]["message"]["content"],
            "latency_ms": round(elapsed * 1000),
            "tokens_used": data.get("usage", {})
        }
    raise Exception(f"Groq error {response.status_code}: {response.text}")

Quick check — what does this function return on success? A dictionary with five keys: provider, model, content, latency_ms, and tokens_used. That consistent shape matters. Every provider returns the same structure. So the router can treat them all the same way.

Warning: Groq’s free tier limits you to 30 requests/minute and 14,400 tokens/minute. Exceed these and you’ll get a 429 status code. Our router will catch this and fall back automatically.

Calling Together AI and OpenRouter

Together AI and OpenRouter follow the same pattern. Only the URL, headers, and model name change. Together AI lives at https://api.together.xyz/v1/chat/completions. OpenRouter lives at https://openrouter.ai/api/v1/chat/completions and needs two extra headers: HTTP-Referer and X-Title.

def call_together(prompt, model="meta-llama/Llama-3.1-8B-Instruct-Turbo", max_tokens=256):
    """Send a chat completion request to Together AI."""
    url = "https://api.together.xyz/v1/chat/completions"
    headers = {
        "Authorization": f"Bearer {TOGETHER_API_KEY}",
        "Content-Type": "application/json"
    }
    payload = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": max_tokens, "temperature": 0.7
    }
    start = time.time()
    response = requests.post(url, headers=headers, json=payload, timeout=30)
    elapsed = time.time() - start

    if response.status_code == 200:
        data = response.json()
        return {
            "provider": "together", "model": model,
            "content": data["choices"][0]["message"]["content"],
            "latency_ms": round(elapsed * 1000),
            "tokens_used": data.get("usage", {})
        }
    raise Exception(f"Together error {response.status_code}: {response.text}")
def call_openrouter(prompt, model="meta-llama/llama-3.1-8b-instruct", max_tokens=256):
    """Send a chat completion request to OpenRouter."""
    url = "https://openrouter.ai/api/v1/chat/completions"
    headers = {
        "Authorization": f"Bearer {OPENROUTER_API_KEY}",
        "Content-Type": "application/json",
        "HTTP-Referer": "https://machinelearningplus.com",
        "X-Title": "MLPlus LLM Router Tutorial"
    }
    payload = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": max_tokens, "temperature": 0.7
    }
    start = time.time()
    response = requests.post(url, headers=headers, json=payload, timeout=30)
    elapsed = time.time() - start

    if response.status_code == 200:
        data = response.json()
        return {
            "provider": "openrouter", "model": model,
            "content": data["choices"][0]["message"]["content"],
            "latency_ms": round(elapsed * 1000),
            "tokens_used": data.get("usage", {})
        }
    raise Exception(f"OpenRouter error {response.status_code}: {response.text}")

print("All three provider functions defined")
python
All three provider functions defined

See the pattern? Same payload. Same response parsing. Same result dictionary. The only differences are the URL, the API key, and the model name. That’s the beauty of the OpenAI-compatible format. Most providers adopted it.

Build the Basic Router with Fallbacks

Here’s where it all comes together. The router tries providers in order. If one fails — timeout, rate limit, server error — it catches the error and moves to the next. If all fail, it raises a clear exception.

The route_request function loops through provider functions. The first successful response wins. Every failure gets logged so you can debug later.

def route_request(prompt, providers=None, max_tokens=256):
    """
    Try each provider in order. Return the first successful response.
    Falls back automatically on any error.
    """
    if providers is None:
        providers = [
            ("groq", call_groq),
            ("together", call_together),
            ("openrouter", call_openrouter)
        ]
    errors = []
    for name, call_fn in providers:
        try:
            result = call_fn(prompt, max_tokens=max_tokens)
            result["fallback_errors"] = errors
            return result
        except Exception as e:
            errors.append({"provider": name, "error": str(e)})
            print(f"[FALLBACK] {name} failed: {e}")

    raise Exception(f"All providers failed: {errors}")

Let’s test it. If your Groq key is valid, it should respond first. If not, the router falls back.

result = route_request("Explain gradient descent in two sentences.")
print(f"Provider used: {result['provider']}")
print(f"Latency: {result['latency_ms']}ms")
print(f"Response: {result['content'][:200]}")
if result["fallback_errors"]:
    print(f"Fallbacks triggered: {len(result['fallback_errors'])}")

That’s the entire fallback mechanism in under 20 lines. No external libraries. No complex configuration. Just a try/except loop with logging.

Key Insight: The fallback order defines your cost-speed tradeoff. Groq first means you optimize for speed. Together AI first means you optimize for cost. OpenRouter last means you always have a safety net.

Predict the output: if Groq returns a 429 error and Together AI succeeds, what will result["provider"] be? It’ll be "together". And result["fallback_errors"] will contain one entry with the Groq error.

Add Cost-Based Routing

The basic router always tries providers in the same order. But what if you want the cheapest available option? Cost-based routing sorts providers by price before trying them.

We’ll define a price table with per-token costs. The router sorts by cost and tries the cheapest first. Here are rough rates as of early 2026. Check each provider’s pricing page for current numbers.

PROVIDER_PRICING = {
    "groq": {
        "model": "llama-3.1-8b-instant",
        "input_per_million": 0.05,
        "output_per_million": 0.08,
        "call_fn": call_groq
    },
    "together": {
        "model": "meta-llama/Llama-3.1-8B-Instruct-Turbo",
        "input_per_million": 0.18,
        "output_per_million": 0.18,
        "call_fn": call_together
    },
    "openrouter": {
        # Note: OpenRouter adds 5-10% markup on top of base provider pricing
        "model": "meta-llama/llama-3.1-8b-instruct",
        "input_per_million": 0.06,
        "output_per_million": 0.06,
        "call_fn": call_openrouter
    }
}

def estimate_cost(provider_name, input_tokens=100, output_tokens=256):
    """Estimate the cost for a request given token counts."""
    pricing = PROVIDER_PRICING[provider_name]
    input_cost = (input_tokens / 1_000_000) * pricing["input_per_million"]
    output_cost = (output_tokens / 1_000_000) * pricing["output_per_million"]
    return input_cost + output_cost

for name in sorted(PROVIDER_PRICING, key=lambda n: estimate_cost(n)):
    cost = estimate_cost(name)
    print(f"{name:12s}: ${cost:.8f} per request (est. 100 in + 256 out)")
python
openrouter  : $0.00002136 per request (est. 100 in + 256 out)
groq        : $0.00002548 per request (est. 100 in + 256 out)
together    : $0.00006408 per request (est. 100 in + 256 out)

OpenRouter’s listed rate is lowest here. But the 5-10% markup isn’t fully reflected. In practice, Groq’s free tier makes it zero cost until you exceed rate limits.

Now the cost-based router. It sorts providers by estimated cost, then calls route_request with that ordering.

def route_by_cost(prompt, input_tokens=100, output_tokens=256, max_tokens=256):
    """Route to the cheapest provider first. Fall back on failure."""
    ranked = sorted(
        PROVIDER_PRICING.items(),
        key=lambda item: estimate_cost(item[0], input_tokens, output_tokens)
    )
    providers = [(name, info["call_fn"]) for name, info in ranked]
    print(f"Cost-based order: {[p[0] for p in providers]}")
    return route_request(prompt, providers=providers, max_tokens=max_tokens)

result = route_by_cost("What is backpropagation in one sentence?")
print(f"Cheapest available: {result['provider']}")
Tip: Track actual costs, not estimates. Each provider returns token usage in the response. Log it. After a few hundred requests, you’ll have real cost data per provider — and it’ll differ from published rates.

Exercise 1: Build a Latency-Based Router

You’ve seen cost-based routing. Now build a latency-based version. Sort providers by recent average latency and try the fastest first.

Exercise 1: Build a Latency-Based Router

**Instructions:** Complete the `route_by_latency` function below. It should track each provider’s average latency and route to the fastest one first.

**Starter Code:**

LATENCY_TRACKER = {
    "groq": {"total_ms": 0, "count": 0},
    "together": {"total_ms": 0, "count": 0},
    "openrouter": {"total_ms": 0, "count": 0}
}

def update_latency(provider_name, latency_ms):
    tracker = LATENCY_TRACKER[provider_name]
    tracker["total_ms"] += latency_ms
    tracker["count"] += 1

def get_avg_latency(provider_name):
    tracker = LATENCY_TRACKER[provider_name]
    if tracker["count"] == 0:
        return 9999
    return tracker["total_ms"] / tracker["count"]

def route_by_latency(prompt, max_tokens=256):
    # YOUR CODE HERE
    # 1. Sort PROVIDER_PRICING keys by get_avg_latency
    # 2. Build providers list of (name, call_fn) tuples
    # 3. Call route_request
    # 4. Update latency tracker with result
    # 5. Return result
    pass

**Hints:**

1. Use `sorted(PROVIDER_PRICING.keys(), key=get_avg_latency)` to rank providers.
2. After getting the result, call `update_latency(result[“provider”], result[“latency_ms”])`.

**Solution:**

def route_by_latency(prompt, max_tokens=256):
    ranked = sorted(PROVIDER_PRICING.keys(), key=get_avg_latency)
    providers = [(name, PROVIDER_PRICING[name]["call_fn"]) for name in ranked]
    print(f"Latency-based order: {[p[0] for p in providers]}")
    result = route_request(prompt, providers=providers, max_tokens=max_tokens)
    update_latency(result["provider"], result["latency_ms"])
    return result

# First call — all defaults (9999ms), so order is arbitrary
result = route_by_latency("What is attention in transformers?")
print(f"Fastest: {result['provider']} ({result['latency_ms']}ms)")

# Second call — uses real data from first call
result = route_by_latency("What is self-attention?")
print(f"Fastest: {result['provider']} ({result['latency_ms']}ms)")

**Explanation:** On the first call, all providers default to 9999ms, so the order depends on Python’s sort stability. After one call, the winner’s latency gets recorded. Future calls route to the fastest provider first.

Compare Providers Side by Side

Before you deploy a router in production, benchmark your providers. Let’s send the same prompt to all three and compare latency, cost, and response quality in a single table.

The benchmark_providers function calls every provider with the same input. It shows four columns: provider name, latency in ms, estimated cost, and a response preview. Latency will vary based on your location and time of day.

def benchmark_providers(prompt, max_tokens=256):
    """Call all providers with the same prompt and compare."""
    results = []
    for name, info in PROVIDER_PRICING.items():
        try:
            result = info["call_fn"](prompt, max_tokens=max_tokens)
            cost = estimate_cost(name,
                result["tokens_used"].get("prompt_tokens", 100),
                result["tokens_used"].get("completion_tokens", 256))
            result["est_cost"] = cost
            results.append(result)
        except Exception as e:
            results.append({"provider": name, "error": str(e),
                           "latency_ms": -1, "est_cost": 0})

    print(f"\n{'Provider':<12} {'Latency':<10} {'Est Cost':<15} {'Preview'}")
    print("-" * 70)
    for r in results:
        if "error" in r:
            print(f"{r['provider']:<12} {'FAILED':<10} {'-':<15} {r['error'][:35]}")
        else:
            preview = r["content"][:35].replace("\n", " ")
            print(f"{r['provider']:<12} {r['latency_ms']:<10}ms ${r['est_cost']:<14.8f} {preview}...")
    return results

results = benchmark_providers("Explain what a neural network is in 2 sentences.")

You’ll likely see Groq winning on latency by a wide margin. Together AI should be competitive on cost. OpenRouter sits in the middle — it adds routing overhead but gives you the widest model selection.

Handle Rate Limits and Errors Gracefully

Ever seen a 429 status code at 3 AM with users waiting? Rate limits are the most common failure mode in multi-provider setups. Every provider has them. When you hit one, you get that 429 HTTP status code — often with a Retry-After header.

A good router tells two error types apart. Transient errors (rate limits, timeouts, 503s) deserve a fallback. Permanent errors (bad API key, model not found) should fail fast. Let’s define custom exceptions for each.

class RateLimitError(Exception):
    """Raised when a provider returns 429 Too Many Requests."""
    def __init__(self, provider, retry_after=None):
        self.provider = provider
        self.retry_after = retry_after
        super().__init__(f"{provider} rate limited. Retry after: {retry_after}s")

class ProviderError(Exception):
    """Raised for non-retryable errors (auth, model not found)."""
    def __init__(self, provider, status_code, message):
        self.provider = provider
        self.status_code = status_code
        super().__init__(f"{provider} error {status_code}: {message}")

Now a wrapper function that classifies errors. It calls any provider function and translates HTTP status codes into our custom exceptions.

def call_provider_safe(name, call_fn, prompt, max_tokens=256):
    """Wrap a provider call with structured error handling."""
    try:
        return call_fn(prompt, max_tokens=max_tokens)
    except Exception as e:
        error_str = str(e)
        if "429" in error_str:
            raise RateLimitError(name, retry_after=60)
        elif "401" in error_str or "403" in error_str:
            raise ProviderError(name, 401, "Invalid API key")
        elif "404" in error_str:
            raise ProviderError(name, 404, "Model not found")
        else:
            raise

print("Error classes and safe wrapper defined")
python
Error classes and safe wrapper defined

The enhanced router uses these error classes. Rate-limited providers get a fallback. Permanently broken providers get disabled for the session. Unknown errors trigger a fallback too.

def route_with_error_handling(prompt, max_tokens=256):
    """Smart router with error classification and fallback."""
    providers = [
        ("groq", call_groq),
        ("together", call_together),
        ("openrouter", call_openrouter)
    ]
    disabled = set()
    errors = []

    for name, call_fn in providers:
        if name in disabled:
            continue
        try:
            result = call_provider_safe(name, call_fn, prompt, max_tokens)
            result["fallback_errors"] = errors
            return result
        except RateLimitError as e:
            errors.append({"provider": name, "type": "rate_limit"})
            print(f"[RATE LIMITED] {name} — falling back")
        except ProviderError as e:
            errors.append({"provider": name, "type": "permanent"})
            disabled.add(name)
            print(f"[DISABLED] {name} — {e}")
        except Exception as e:
            errors.append({"provider": name, "type": "unknown"})
            print(f"[ERROR] {name} — {e}")

    raise Exception(f"All providers failed: {errors}")

result = route_with_error_handling("What is fine-tuning?")
print(f"Provider: {result['provider']}")
Warning: Don’t retry rate-limited providers immediately. If Groq returns 429, hammering it again makes things worse. Fall back instead. In production, add a cooldown timer — skip the rate-limited provider for 60 seconds.

Exercise 2: Add a Cooldown Timer

The current router has no cooldown for rate limits. Add a time-based cooldown so rate-limited providers get re-enabled automatically after a waiting period.

Exercise 2: Add a Provider Cooldown Timer

**Instructions:** Complete the `is_provider_available` function. When a provider gets rate-limited, record the timestamp. Don’t try it again until the cooldown expires.

**Starter Code:**

PROVIDER_COOLDOWNS = {}
COOLDOWN_SECONDS = 60

def set_cooldown(provider_name):
    PROVIDER_COOLDOWNS[provider_name] = time.time() + COOLDOWN_SECONDS
    print(f"[COOLDOWN] {provider_name} cooling down for {COOLDOWN_SECONDS}s")

def is_provider_available(provider_name):
    # YOUR CODE HERE
    # Return True if no cooldown or cooldown expired
    # Clean up expired entries
    pass

**Hints:**

1. If `provider_name` isn’t in `PROVIDER_COOLDOWNS`, it’s available — return `True`.
2. Compare `PROVIDER_COOLDOWNS[provider_name]` with `time.time()`. If current time is past the cooldown, delete the entry and return `True`.

**Solution:**

def is_provider_available(provider_name):
    if provider_name not in PROVIDER_COOLDOWNS:
        return True
    if time.time() >= PROVIDER_COOLDOWNS[provider_name]:
        del PROVIDER_COOLDOWNS[provider_name]
        return True
    remaining = PROVIDER_COOLDOWNS[provider_name] - time.time()
    print(f"[SKIP] {provider_name} cooling down ({remaining:.0f}s left)")
    return False

set_cooldown("groq")
print(f"Groq available? {is_provider_available('groq')}")
print(f"Together available? {is_provider_available('together')}")
python
[COOLDOWN] groq cooling down for 60s
Groq available? False
Together available? True

**Explanation:** The function checks the cooldown dictionary. No entry means available. Expired entry gets cleaned up and returns True. Active cooldown prints the remaining time and returns False.

Build a Production-Ready Router Class

Let’s combine everything into one class. The LLMRouter supports multiple routing strategies, automatic fallbacks, cooldown timers, and request logging.

The constructor takes a list of provider configs and a default strategy. Each config includes the name, call function, and pricing.

class LLMRouter:
    """Multi-provider LLM router with fallbacks and routing strategies."""

    def __init__(self, providers, default_strategy="cost"):
        self.providers = providers
        self.default_strategy = default_strategy
        self.cooldowns = {}
        self.latency_history = {p["name"]: [] for p in providers}
        self.request_log = []
        self.cooldown_seconds = 60

The helper methods handle cooldowns, latency tracking, and ranking. _rank_providers sorts available providers by the chosen strategy.

    def _is_available(self, name):
        if name not in self.cooldowns:
            return True
        if time.time() >= self.cooldowns[name]:
            del self.cooldowns[name]
            return True
        return False

    def _set_cooldown(self, name):
        self.cooldowns[name] = time.time() + self.cooldown_seconds

    def _get_avg_latency(self, name):
        history = self.latency_history[name]
        if not history:
            return 9999
        recent = history[-10:]  # Last 10 requests
        return sum(recent) / len(recent)

    def _rank_providers(self, strategy):
        available = [p for p in self.providers if self._is_available(p["name"])]
        if strategy == "cost":
            return sorted(available, key=lambda p: p["cost_per_million_output"])
        elif strategy == "latency":
            return sorted(available, key=lambda p: self._get_avg_latency(p["name"]))
        return available  # "priority" — use defined order

The route method is the main entry point. It ranks providers, tries each one, and handles fallbacks. Success logs the latency. A 429 error sets a cooldown.

    def route(self, prompt, strategy=None, max_tokens=256):
        strategy = strategy or self.default_strategy
        ranked = self._rank_providers(strategy)
        if not ranked:
            raise Exception("No providers available (all in cooldown)")

        errors = []
        for provider in ranked:
            name = provider["name"]
            try:
                result = provider["call_fn"](prompt, max_tokens=max_tokens)
                self.latency_history[name].append(result["latency_ms"])
                self.request_log.append({
                    "provider": name, "strategy": strategy,
                    "latency_ms": result["latency_ms"], "success": True
                })
                result["fallback_errors"] = errors
                result["strategy_used"] = strategy
                return result
            except Exception as e:
                errors.append({"provider": name, "error": str(e)})
                if "429" in str(e):
                    self._set_cooldown(name)
                self.request_log.append({
                    "provider": name, "strategy": strategy,
                    "error": str(e), "success": False
                })
        raise Exception(f"All providers failed: {errors}")

    def get_stats(self):
        """Return routing statistics."""
        total = len(self.request_log)
        successes = sum(1 for r in self.request_log if r["success"])
        by_provider = {}
        for r in self.request_log:
            name = r["provider"]
            if name not in by_provider:
                by_provider[name] = {"success": 0, "fail": 0}
            key = "success" if r["success"] else "fail"
            by_provider[name][key] += 1
        return {
            "total_requests": total,
            "success_rate": f"{successes/total*100:.1f}%" if total else "N/A",
            "by_provider": by_provider,
            "avg_latency": {
                n: f"{self._get_avg_latency(n):.0f}ms"
                for n in self.latency_history
            }
        }

print("LLMRouter class defined")
python
LLMRouter class defined

Note: In a real project, you’d put this class in a single file. We split it here for readability — each block covers one responsibility.

Now let’s put it to work. We’ll configure all three providers and test different routing strategies.

router = LLMRouter(
    providers=[
        {"name": "groq", "call_fn": call_groq, "cost_per_million_output": 0.08},
        {"name": "together", "call_fn": call_together, "cost_per_million_output": 0.18},
        {"name": "openrouter", "call_fn": call_openrouter, "cost_per_million_output": 0.06}
    ],
    default_strategy="cost"
)

# Cost-based routing
result = router.route("What is transfer learning?", strategy="cost")
print(f"Cost routing -> {result['provider']} ({result['latency_ms']}ms)")

# Latency-based routing
result = router.route("What is a loss function?", strategy="latency")
print(f"Latency routing -> {result['provider']} ({result['latency_ms']}ms)")

# Priority-based routing (use defined order)
result = router.route("What is batch normalization?", strategy="priority")
print(f"Priority routing -> {result['provider']} ({result['latency_ms']}ms)")

print(f"\nRouter stats:")
print(json.dumps(router.get_stats(), indent=2))

You can swap strategies per request. Use cost routing for batch jobs. Use latency routing for user-facing responses. Use priority for maximum control.

Exercise 3: Add Weighted Random Routing

Add a "weighted" strategy where providers are chosen randomly but weighted by a reliability score. Higher weight means more likely to be chosen first.

Exercise 3: Add Weighted Random Routing

**Instructions:** Write a `weighted_random_order` function that returns providers in random order weighted by a `weight` field. Then add it as a strategy in `_rank_providers`.

**Starter Code:**

import random

def weighted_random_order(providers):
    # YOUR CODE HERE
    # Return providers in weighted random order (no duplicates)
    # Higher weight = more likely to be first
    pass

**Hints:**

1. Use `random.choices(remaining, weights=…, k=1)` to pick one weighted item at a time.
2. Remove the chosen item from the remaining list. Repeat until empty.

**Solution:**

import random

def weighted_random_order(providers):
    """Return providers in weighted random order (no duplicates)."""
    remaining = list(providers)
    result = []
    while remaining:
        weights = [p.get("weight", 1) for p in remaining]
        chosen = random.choices(remaining, weights=weights, k=1)[0]
        result.append(chosen)
        remaining.remove(chosen)
    return result

# Test distribution over 1000 runs
test_providers = [
    {"name": "groq", "weight": 5},
    {"name": "together", "weight": 3},
    {"name": "openrouter", "weight": 2}
]
first_place = {"groq": 0, "together": 0, "openrouter": 0}
for _ in range(1000):
    order = weighted_random_order(test_providers)
    first_place[order[0]["name"]] += 1

print("First-place distribution (1000 runs):")
for name, count in sorted(first_place.items(), key=lambda x: -x[1]):
    print(f"  {name}: {count/10:.1f}%")
python
First-place distribution (1000 runs):
  groq: 50.2%
  together: 30.1%
  openrouter: 19.7%

**Explanation:** The function picks one weighted random item at a time from the remaining list. Groq (weight 5) gets picked first ~50% of the time. Together (weight 3) gets ~30%. This distributes load while still favoring reliable providers.

What Is LiteLLM? A Higher-Level Alternative

Everything we’ve built uses raw HTTP. That’s great for learning and for full control. But there’s a popular library that handles multi-provider routing out of the box.

LiteLLM is an open-source Python SDK. It wraps 100+ providers behind one completion() call. Routing, fallbacks, load balancing, cost tracking — it does all of it. Here’s what our router looks like with LiteLLM. This is a preview, not runnable code.

# Preview only — requires: pip install litellm
# LiteLLM doesn't work in Pyodide (needs httpx)

# from litellm import Router
#
# router = Router(
#     model_list=[
#         {"model_name": "fast-llm", "litellm_params": {
#             "model": "groq/llama-3.1-8b-instant", "api_key": GROQ_API_KEY}},
#         {"model_name": "fast-llm", "litellm_params": {
#             "model": "together_ai/meta-llama/Llama-3.1-8B-Instruct-Turbo",
#             "api_key": TOGETHER_API_KEY}},
#         {"model_name": "fast-llm", "litellm_params": {
#             "model": "openrouter/meta-llama/llama-3.1-8b-instruct",
#             "api_key": OPENROUTER_API_KEY}}
#     ],
#     routing_strategy="cost-based-routing",
#     fallbacks=[{"fast-llm": ["fast-llm"]}]
# )

print("LiteLLM preview — see docs.litellm.ai/docs/routing")
python
LiteLLM preview — see docs.litellm.ai/docs/routing

LiteLLM groups all three providers under one virtual model name (fast-llm). It supports six strategies: simple-shuffle, least-busy, usage-based-routing, usage-based-routing-v2, latency-based-routing, and cost-based-routing. It also supports async calls via router.acompletion(). Useful if you’re building with asyncio or FastAPI.

Should you use LiteLLM or build your own? Here’s how I think about it:

ScenarioBest Approach
Learning how routing worksBuild your own (this tutorial)
Quick prototype, < 10K req/dayLiteLLM — saves time
Production with custom logicBuild your own — full control
Production with standard routingLiteLLM — battle-tested

Other Routing Frameworks Worth Knowing

LiteLLM isn’t the only game in town. Two other frameworks take different approaches:

  • RouteLLM (by LMSYS) routes between a strong model and a weak model based on query complexity. It claims 85% cost savings while keeping 95% of GPT-4 quality. Send easy questions to a cheap model. Send hard ones to an expensive one.
  • NVIDIA LLM Router is an enterprise blueprint that routes requests based on task difficulty. It targets production deployments with GPU infrastructure.

Both solve a different problem than our router. We route across providers (Groq vs Together vs OpenRouter). They route across model tiers (cheap vs expensive). You could combine both. Use our provider router inside a model-tier router.

When NOT to Build Your Own Router

Not every project needs a router. Here are cases where simpler approaches work better.

Use OpenRouter alone when you don’t care which provider serves the request. OpenRouter already IS a router — it picks the cheapest provider internally. Adding your own layer on top is double-routing.

Use a single provider when your traffic is predictable. If you’re making 100 requests a day on Groq’s free tier, you don’t need fallbacks. Keep it simple.

Use LiteLLM when you need production routing quickly but don’t have custom requirements. It handles rate limits, retries, cost tracking, and 100+ providers out of the box.

Tip: Start simple, add complexity when you need it. Most projects don’t need a router on day one. Start with one provider. Add a second when you hit rate limits. Build a full router only when simpler approaches fall short.

Common Mistakes and Troubleshooting

Here are the errors you’ll hit when working with multi-provider setups.

Mistake 1: Not handling timeouts separately from other errors

# Bad — everything gets the same treatment
try:
    result = call_groq(prompt)
except Exception:
    result = call_together(prompt)

# Better — timeouts get a shorter retry window
try:
    result = call_groq(prompt)
except requests.exceptions.Timeout:
    print("Groq timed out — falling back fast")
    result = call_together(prompt)
except Exception as e:
    print(f"Groq error: {e} — falling back")
    result = call_together(prompt)

Mistake 2: Ignoring the Retry-After header

When a provider returns 429, it often includes a Retry-After header. That header tells you exactly how long to wait. Ignore it and you keep hammering a rate-limited endpoint.

response = requests.post(url, headers=headers, json=payload)
if response.status_code == 429:
    retry_after = int(response.headers.get("Retry-After", 60))
    print(f"Rate limited — server says wait {retry_after}s")

Mistake 3: Using wrong model names across providers

Each provider has its own naming scheme. llama-3.1-8b-instant works on Groq but not Together AI. Your router needs a model name mapping.

MODEL_MAP = {
    "llama-3.1-8b": {
        "groq": "llama-3.1-8b-instant",
        "together": "meta-llama/Llama-3.1-8B-Instruct-Turbo",
        "openrouter": "meta-llama/llama-3.1-8b-instruct"
    }
}

def get_model_name(logical_name, provider):
    return MODEL_MAP.get(logical_name, {}).get(provider, logical_name)

print(get_model_name("llama-3.1-8b", "groq"))
print(get_model_name("llama-3.1-8b", "together"))
print(get_model_name("llama-3.1-8b", "openrouter"))
python
llama-3.1-8b-instant
meta-llama/Llama-3.1-8B-Instruct-Turbo
meta-llama/llama-3.1-8b-instruct

Mistake 4: Not logging fallback events

If your router silently falls back, you’ll never know there’s a problem. Always log which provider handled each request. Without logs, you’re flying blind.

Summary

You built a multi-provider LLM router from scratch. Here’s what you covered:

  • Three providers: Groq (speed), Together AI (cost-predictable open-source), OpenRouter (universal gateway)
  • Raw HTTP calls: Each uses the OpenAI-compatible chat completions format
  • Basic fallback: Try providers in order, catch errors, move to the next
  • Cost-based routing: Sort by price, cheapest first
  • Latency-based routing: Track response times, route to the fastest
  • Error classification: Separate rate limits from permanent errors
  • Cooldown timers: Skip rate-limited providers temporarily
  • Production router class: Combines all strategies with logging and stats

Practice Exercise

Practice: Build a Router with Circuit Breaker

**Challenge:** Extend `LLMRouter` with a circuit breaker. After 3 consecutive failures, “open” the circuit for 5 minutes. After 5 minutes, allow one probe request. If it succeeds, close the circuit. If it fails, re-open.

**Hint:** Track `failure_count` and `circuit_state` (“closed”, “open”, “half-open”) per provider.

**Solution:**

class CircuitBreakerRouter(LLMRouter):
    def __init__(self, providers, default_strategy="cost"):
        super().__init__(providers, default_strategy)
        self.failure_count = {p["name"]: 0 for p in providers}
        self.circuit_state = {p["name"]: "closed" for p in providers}
        self.circuit_open_until = {}
        self.max_failures = 3
        self.circuit_timeout = 300  # 5 minutes

    def _is_available(self, name):
        state = self.circuit_state[name]
        if state == "closed":
            return True
        if state == "open":
            if time.time() >= self.circuit_open_until.get(name, 0):
                self.circuit_state[name] = "half-open"
                return True
            return False
        return True  # half-open allows one probe

    def _record_success(self, name):
        self.failure_count[name] = 0
        self.circuit_state[name] = "closed"

    def _record_failure(self, name):
        self.failure_count[name] += 1
        if self.failure_count[name] >= self.max_failures:
            self.circuit_state[name] = "open"
            self.circuit_open_until[name] = time.time() + self.circuit_timeout
            print(f"[CIRCUIT OPEN] {name} disabled for {self.circuit_timeout}s")

print("CircuitBreakerRouter defined — extend it for your project")

FAQ

Can I mix proprietary and open-source models in one router?

Yes, and it’s one of the best use cases. Route cheap tasks (summarization, classification) to open-source models on Together AI. Route complex tasks (reasoning, code generation) to GPT-4o via OpenRouter.

def route_by_task(prompt, task_type="general"):
    if task_type == "simple":
        return route_request(prompt, providers=[
            ("together", call_together), ("groq", call_groq)])
    return route_request(prompt, providers=[
        ("openrouter", call_openrouter), ("together", call_together)])

What’s the latency overhead of a router layer?

Almost zero. The routing logic (sorting, checking cooldowns) takes microseconds. The real latency comes from HTTP requests. A failed attempt adds one round-trip before fallback. That’s why short timeouts (5-10 seconds) matter.

Do I need all three providers?

Two is plenty for most cases. Groq + OpenRouter gives you speed plus universal fallback. Together AI + Groq gives you cost optimization plus speed. Add the third when you need it.

How do I handle streaming with a router?

Streaming adds complexity. You don’t know if a provider failed until you’re partway through. The safest approach: send a non-streaming “ping” first (1 token), then stream from the same provider. Or accept that mid-stream fallback means restarting.

References

  1. Groq API documentation — Chat Completions. Link
  2. Together AI documentation — Inference API. Link
  3. OpenRouter documentation — Provider Routing. Link
  4. LiteLLM documentation — Router and Load Balancing. Link
  5. LiteLLM documentation — Fallbacks and Reliability. Link
  6. OpenRouter documentation — Quickstart Guide. Link
  7. Groq pricing — Tokens as a Service. Link
  8. Together AI — Serverless Inference. Link
  9. RouteLLM — A Framework for Serving and Evaluating LLM Routers. Link
  10. NVIDIA LLM Router Blueprint — Route LLM Requests to the Best Model. Link

Complete Code

Click to expand the full script (copy-paste and run)
# Complete code from: Multi-Provider LLM Router
# Requires: pip install requests python-dotenv
# Python 3.10+

import os
import time
import json
import requests

# --- Configuration ---
GROQ_API_KEY = os.environ.get("GROQ_API_KEY", "your_groq_key_here")
TOGETHER_API_KEY = os.environ.get("TOGETHER_API_KEY", "your_together_key_here")
OPENROUTER_API_KEY = os.environ.get("OPENROUTER_API_KEY", "your_openrouter_key_here")

# --- Provider Functions ---
def call_groq(prompt, model="llama-3.1-8b-instant", max_tokens=256):
    url = "https://api.groq.com/openai/v1/chat/completions"
    headers = {"Authorization": f"Bearer {GROQ_API_KEY}", "Content-Type": "application/json"}
    payload = {"model": model, "messages": [{"role": "user", "content": prompt}],
               "max_tokens": max_tokens, "temperature": 0.7}
    start = time.time()
    response = requests.post(url, headers=headers, json=payload, timeout=30)
    elapsed = time.time() - start
    if response.status_code == 200:
        data = response.json()
        return {"provider": "groq", "model": model,
                "content": data["choices"][0]["message"]["content"],
                "latency_ms": round(elapsed * 1000), "tokens_used": data.get("usage", {})}
    raise Exception(f"Groq error {response.status_code}: {response.text}")

def call_together(prompt, model="meta-llama/Llama-3.1-8B-Instruct-Turbo", max_tokens=256):
    url = "https://api.together.xyz/v1/chat/completions"
    headers = {"Authorization": f"Bearer {TOGETHER_API_KEY}", "Content-Type": "application/json"}
    payload = {"model": model, "messages": [{"role": "user", "content": prompt}],
               "max_tokens": max_tokens, "temperature": 0.7}
    start = time.time()
    response = requests.post(url, headers=headers, json=payload, timeout=30)
    elapsed = time.time() - start
    if response.status_code == 200:
        data = response.json()
        return {"provider": "together", "model": model,
                "content": data["choices"][0]["message"]["content"],
                "latency_ms": round(elapsed * 1000), "tokens_used": data.get("usage", {})}
    raise Exception(f"Together error {response.status_code}: {response.text}")

def call_openrouter(prompt, model="meta-llama/llama-3.1-8b-instruct", max_tokens=256):
    url = "https://openrouter.ai/api/v1/chat/completions"
    headers = {"Authorization": f"Bearer {OPENROUTER_API_KEY}", "Content-Type": "application/json",
               "HTTP-Referer": "https://machinelearningplus.com", "X-Title": "MLPlus LLM Router"}
    payload = {"model": model, "messages": [{"role": "user", "content": prompt}],
               "max_tokens": max_tokens, "temperature": 0.7}
    start = time.time()
    response = requests.post(url, headers=headers, json=payload, timeout=30)
    elapsed = time.time() - start
    if response.status_code == 200:
        data = response.json()
        return {"provider": "openrouter", "model": model,
                "content": data["choices"][0]["message"]["content"],
                "latency_ms": round(elapsed * 1000), "tokens_used": data.get("usage", {})}
    raise Exception(f"OpenRouter error {response.status_code}: {response.text}")

# --- Router ---
def route_request(prompt, providers=None, max_tokens=256):
    if providers is None:
        providers = [("groq", call_groq), ("together", call_together),
                     ("openrouter", call_openrouter)]
    errors = []
    for name, call_fn in providers:
        try:
            result = call_fn(prompt, max_tokens=max_tokens)
            result["fallback_errors"] = errors
            return result
        except Exception as e:
            errors.append({"provider": name, "error": str(e)})
            print(f"[FALLBACK] {name} failed: {e}")
    raise Exception(f"All providers failed: {errors}")

# --- Test ---
result = route_request("What is a multi-provider LLM router?")
print(f"Provider: {result['provider']}")
print(f"Latency: {result['latency_ms']}ms")
print(f"Response: {result['content'][:200]}")
print("\nScript completed successfully.")
Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Free Callback - Limited Slots
Not Sure Which Course to Start With?
Talk to our AI Counsellors and Practitioners. We'll help you clear all your questions for your background and goals, bridging the gap between your current skills and a career in AI.
10-digit mobile number
📞
Thank You!
We'll Call You Soon!
Our learning advisor will reach out within 24 hours.
(Check your inbox too — we've sent a confirmation)
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science