Menu

Groq vs Fireworks vs Together AI: Speed Benchmark

Build a Python benchmark harness comparing Groq, Fireworks, Together AI, and Replicate on latency, throughput, and cost with runnable code.

Written by Selva Prabhakaran | 28 min read

Four providers, one model, raw HTTP calls. Build a benchmark harness that measures latency, throughput, and cost — then let the numbers pick the right backend for your workload.

Interactive Code Blocks — The Python code blocks in this article are runnable. Click the Run button to execute them right in your browser.

You pick a model — say, Llama 3.1 8B. Four providers will serve it. Same weights, same architecture. But one returns tokens in 200ms. Another takes 3 seconds. A third charges half the price.

How do you pick?

Most teams guess. They read a blog post, pick a provider, and ship. Months later, they realize they’re overpaying or watching users abandon slow responses. That’s fixable — you run a benchmark on your own prompts and let the numbers decide.

Key Insight: Same model, different host, wildly different speed. Where you run your model shapes cost, speed, and throughput more than the model itself.

That’s what we’ll build. A test harness that sends the same prompt to all four using raw HTTP. It tracks total time, time to first word, tokens per second, and cost per million tokens. You’ll also get a scoring tool that ranks them by what matters to you.

What Are Inference Providers and Why Do They Differ?

Think of it this way. You send a prompt to a cloud service. That service runs the model on its own machines. You get tokens back.

So why does speed differ for the same model? The chips doing the work are different. Groq built custom LPU chips just for fast token output. Fireworks uses GPUs with a custom speed engine called FireAttention. Together AI spreads work across many GPUs at once. Replicate boots up fresh machines on demand.

These gaps in speed and price are real — and large.

Provider Hardware Key Strength API Style
Groq Custom LPU silicon Ultra-low latency OpenAI-compatible
Fireworks AI Optimized GPU (FireAttention) High throughput, serverless OpenAI-compatible
Together AI GPU cluster 200+ open models, fine-tuning OpenAI-compatible
Replicate On-demand GPU containers Model marketplace, any modality Custom REST API
Tip: Three of the four use the same API format as OpenAI. Switch between Groq, Fireworks, and Together by changing just the base URL and API key. Replicate uses its own REST format.

Prerequisites

  • Python version: 3.9+
  • Required libraries: requests, tabulate, matplotlib
  • Install: pip install requests tabulate matplotlib
  • API keys: Free-tier keys from each provider (links below)
  • Time to complete: 20-25 minutes

You’ll need API keys from each provider:
– Groq: console.groq.com — free tier with generous limits (~30 RPM)
– Fireworks AI: fireworks.ai — pay-per-token, free credits on signup
– Together AI: api.together.xyz — free tier available (~60 RPM)
– Replicate: replicate.com — pay-per-second billing

Note: Don’t have all four keys yet? No problem. The harness handles missing providers gracefully. Run it with whichever providers you have and add more later.

How to Call Each Provider with Raw HTTP

Why raw HTTP and not SDKs? I like it for speed tests because you see what’s really going on. No SDK tricks hiding slow calls. You measure the service itself, not the wrapper.

We’ll use Python’s requests library for all four. Each call sends the same prompt and times the round trip. Here’s the setup.

import micropip
await micropip.install(["requests"])

import requests
import time
import json
import os

# Shared test prompt — identical for all providers
PROMPT = "Explain gradient descent in 3 sentences for a beginner."

# Store results
results = []

Calling Groq

Groq’s API looks just like OpenAI’s. The base URL is https://api.groq.com/openai/v1/chat/completions. You send a model name, a list of messages, and your API key in the header.

Their Llama 3.1 8B model is called llama-3.1-8b-instant. The instant tag means it’s tuned for fast output.

import micropip
await micropip.install(["requests"])

def call_groq(prompt, api_key):
    """Call Groq's OpenAI-compatible endpoint."""
    url = "https://api.groq.com/openai/v1/chat/completions"
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    payload = {
        "model": "llama-3.1-8b-instant",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 256,
        "temperature": 0.0
    }

    start = time.time()
    response = requests.post(url, headers=headers, json=payload)
    total_time = time.time() - start

    data = response.json()
    text = data["choices"][0]["message"]["content"]
    tokens = data["usage"]["completion_tokens"]

    return {
        "provider": "Groq",
        "text": text,
        "total_time_s": round(total_time, 3),
        "completion_tokens": tokens,
        "tokens_per_second": round(tokens / total_time, 1)
    }

The flow: send JSON, start a clock, read the reply. Every call function follows this same shape. Only the URL, model name, and return label change.

Calling Fireworks AI

Fireworks uses the same format as OpenAI too. The base URL is https://api.fireworks.ai/inference/v1/chat/completions. Their model name has a full path: accounts/fireworks/models/llama-v3p1-8b-instruct.

I like that Fireworks puts the account name right in the model path. You always know where it lives.

import micropip
await micropip.install(["requests"])

def call_fireworks(prompt, api_key):
    """Call Fireworks AI's OpenAI-compatible endpoint."""
    url = "https://api.fireworks.ai/inference/v1/chat/completions"
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    payload = {
        "model": "accounts/fireworks/models/llama-v3p1-8b-instruct",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 256,
        "temperature": 0.0
    }

    start = time.time()
    response = requests.post(url, headers=headers, json=payload)
    total_time = time.time() - start

    data = response.json()
    text = data["choices"][0]["message"]["content"]
    tokens = data["usage"]["completion_tokens"]

    return {
        "provider": "Fireworks AI",
        "text": text,
        "total_time_s": round(total_time, 3),
        "completion_tokens": tokens,
        "tokens_per_second": round(tokens / total_time, 1)
    }

See how close the code is? That’s the nice part about these shared API formats. Write one call function, then copy-paste it with three tiny changes.

Calling Together AI

Together AI works the same way. Base URL: https://api.together.xyz/v1/chat/completions. Their Llama 3.1 8B model is meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo.

The Turbo tag means Together tuned this version for speed. They offer both normal and turbo modes for popular models.

import micropip
await micropip.install(["requests"])

def call_together(prompt, api_key):
    """Call Together AI's OpenAI-compatible endpoint."""
    url = "https://api.together.xyz/v1/chat/completions"
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    payload = {
        "model": "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 256,
        "temperature": 0.0
    }

    start = time.time()
    response = requests.post(url, headers=headers, json=payload)
    total_time = time.time() - start

    data = response.json()
    text = data["choices"][0]["message"]["content"]
    tokens = data["usage"]["completion_tokens"]

    return {
        "provider": "Together AI",
        "text": text,
        "total_time_s": round(total_time, 3),
        "completion_tokens": tokens,
        "tokens_per_second": round(tokens / total_time, 1)
    }

Quick check: You’ve now seen three provider functions. What are the only three things that change between them? (The URL, the model name, and the provider label in the return dict.)

Calling Replicate

Replicate is the odd one out. It doesn’t use the OpenAI format. You POST to https://api.replicate.com/v1/predictions with an input object. The API gives back a URL. You check that URL in a loop until the result is done.

This poll loop adds time. Replicate works with any model type — images, video, audio, text — so they use a broad async API. The cost is extra wait time from polling.

import micropip
await micropip.install(["requests"])

def call_replicate(prompt, api_key):
    """Call Replicate's prediction API (poll-based)."""
    url = "https://api.replicate.com/v1/predictions"
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    payload = {
        "model": "meta/meta-llama-3.1-8b-instruct",
        "input": {
            "prompt": prompt,
            "max_tokens": 256,
            "temperature": 0.0
        }
    }

    start = time.time()
    response = requests.post(url, headers=headers, json=payload)
    prediction = response.json()

    # Poll until the prediction completes
    poll_url = prediction["urls"]["get"]
    while True:
        poll_resp = requests.get(poll_url, headers=headers)
        status = poll_resp.json()
        if status["status"] == "succeeded":
            break
        elif status["status"] == "failed":
            raise RuntimeError(
                f"Replicate failed: {status.get('error')}"
            )
        time.sleep(0.5)

    total_time = time.time() - start
    output_text = "".join(status["output"])

    # Replicate doesn't always return token counts
    word_count = len(output_text.split())
    est_tokens = int(word_count * 1.3)

    return {
        "provider": "Replicate",
        "text": output_text,
        "total_time_s": round(total_time, 3),
        "completion_tokens": est_tokens,
        "tokens_per_second": round(est_tokens / total_time, 1)
    }

The token count here is a rough guess. Replicate’s API doesn’t always report tokens used, so we guess with word_count * 1.3. Close enough for testing.

Warning: Replicate’s first call has cold-start lag. Booting up a new box adds 5-30 seconds. Always do a warm-up call first. Cloudflare bought Replicate in 2025, which helped — but cold starts still lag behind always-on hosts like Groq.

Adding Error Handling and Retries

Real tests hit rate limits and timeouts. You’ll want retry logic before running a full test. Here’s a wrapper that handles HTTP 429 (rate limit) and lost connections with rising wait times.

How it works: wait 1 second after the first fail, then 2 seconds, then 4. After three tries, it stops. Most rate limits clear in a few seconds.

def call_with_retry(call_fn, prompt, api_key, max_retries=3):
    """Wrap any provider call with retry logic."""
    for attempt in range(max_retries + 1):
        try:
            result = call_fn(prompt, api_key)
            return result
        except requests.exceptions.HTTPError as e:
            if e.response.status_code == 429:
                if attempt < max_retries:
                    wait = 2 ** attempt
                    print(f"  Rate limited. Waiting {wait}s...")
                    time.sleep(wait)
                else:
                    raise
            else:
                raise
        except requests.exceptions.ConnectionError:
            if attempt < max_retries:
                wait = 2 ** attempt
                print(f"  Connection error. Retry in {wait}s...")
                time.sleep(wait)
            else:
                raise
    return None
Tip: Free tiers have tight rate limits. Groq allows ~30 requests per minute. Together AI offers ~60 RPM. If you hit limits during benchmarking, reduce `num_trials` or add longer pauses between calls.

Here’s a quick reference for free-tier limits:

Provider Free Tier RPM Billing Model
Groq ~30 Free tier generous
Fireworks AI Varies Pay-per-token
Together AI ~60 Free tier + paid
Replicate Varies Pay-per-second

Building the Benchmark Harness

One API call tells you nothing. Network noise, cold starts, and server load all muddy the result. You need many trials and a clean summary.

I’ve found that 5 trials with the median works well. The median throws out flukes — one slow call from a blip won’t wreck your data. The harness does a warm-up call first, then runs 5 timed tries.

def run_benchmark(providers, prompt, num_trials=5):
    """Run benchmark: warm-up + N timed trials per provider."""
    all_results = {}

    for provider in providers:
        name = provider["name"]
        call_fn = provider["call_fn"]
        api_key = provider["api_key"]
        trials = []

        print(f"\nBenchmarking {name}...")

        # Warm-up call (result discarded)
        try:
            call_with_retry(call_fn, prompt, api_key)
        except Exception as e:
            print(f"  Warm-up failed: {e}")
            continue

        for i in range(num_trials):
            try:
                result = call_with_retry(
                    call_fn, prompt, api_key
                )
                if result:
                    trials.append(result)
                    print(f"  Trial {i+1}: "
                          f"{result['total_time_s']}s, "
                          f"{result['tokens_per_second']} tok/s")
            except Exception as e:
                print(f"  Trial {i+1} failed: {e}")

        if trials:
            trials.sort(key=lambda x: x["total_time_s"])
            median_idx = len(trials) // 2
            median = trials[median_idx]
            times = [t["total_time_s"] for t in trials]
            speeds = [t["tokens_per_second"] for t in trials]

            all_results[name] = {
                "median_time_s": median["total_time_s"],
                "median_tps": median["tokens_per_second"],
                "min_time_s": min(times),
                "max_time_s": max(times),
                "p95_time_s": times[int(len(times) * 0.95)]
                              if len(times) >= 5
                              else times[-1],
                "avg_tps": round(
                    sum(speeds) / len(speeds), 1
                ),
                "sample_output": median["text"][:200],
                "trials": len(trials)
            }

    return all_results
Key Insight: Use the median, not the mean, for speed tests. One slow call from a network blip blows up the mean. The median shows what you’ll get most of the time.

Running the Benchmark

Time to wire everything together. Configure the provider list and run the harness. Replace the placeholder keys with your actual API keys.

providers = [
    {"name": "Groq", "call_fn": call_groq,
     "api_key": os.environ.get("GROQ_API_KEY", "your-key")},
    {"name": "Fireworks AI", "call_fn": call_fireworks,
     "api_key": os.environ.get("FIREWORKS_API_KEY", "your-key")},
    {"name": "Together AI", "call_fn": call_together,
     "api_key": os.environ.get("TOGETHER_API_KEY", "your-key")},
    {"name": "Replicate", "call_fn": call_replicate,
     "api_key": os.environ.get("REPLICATE_API_TOKEN", "your-key")},
]

benchmark_results = run_benchmark(providers, PROMPT, num_trials=5)

The output shows each trial’s latency and tokens/sec. Your numbers depend on network conditions and server load.

Displaying and Visualizing Results

Raw numbers need shape. Let’s put them in a table, then draw a bar chart so the gaps pop out fast.

from tabulate import tabulate

def display_results(results):
    """Format benchmark results as a comparison table."""
    headers = [
        "Provider", "Median (s)", "p95 (s)",
        "Median TPS", "Avg TPS", "Trials"
    ]
    rows = []

    for name, stats in results.items():
        rows.append([
            name,
            stats["median_time_s"],
            stats.get("p95_time_s", "-"),
            stats["median_tps"],
            stats["avg_tps"],
            stats["trials"]
        ])

    rows.sort(key=lambda x: x[1])

    print("\n=== Benchmark Results ===\n")
    print(tabulate(rows, headers=headers, tablefmt="grid"))

display_results(benchmark_results)

Your table will look something like this (numbers vary per run):

python
+---------------+-------------+---------+-------------+-----------+---------+
| Provider      | Median (s)  | p95 (s) | Median TPS  | Avg TPS   | Trials  |
+===============+=============+=========+=============+===========+=========+
| Groq          | 0.4         | 0.6     | 580.0       | 550.2     | 5       |
| Fireworks AI  | 0.8         | 1.0     | 320.0       | 310.5     | 5       |
| Together AI   | 1.0         | 1.3     | 250.0       | 245.8     | 5       |
| Replicate     | 2.5         | 4.0     | 95.0        | 88.3      | 5       |
+---------------+-------------+---------+-------------+-----------+---------+

A bar chart makes the gaps jump off the screen. The green bar marks the winner in each test.

import matplotlib.pyplot as plt

def plot_benchmark(results):
    """Draw grouped bar chart of latency and throughput."""
    names = list(results.keys())
    latencies = [results[n]["median_time_s"] for n in names]
    tps_values = [results[n]["median_tps"] for n in names]

    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

    # Latency (lower is better)
    colors_lat = ["#2ecc71" if l == min(latencies)
                   else "#3498db" for l in latencies]
    ax1.barh(names, latencies, color=colors_lat)
    ax1.set_xlabel("Median Latency (seconds)")
    ax1.set_title("Latency (lower is better)")
    ax1.invert_yaxis()
    for i, v in enumerate(latencies):
        ax1.text(v + 0.05, i, f"{v:.2f}s", va="center")

    # Throughput (higher is better)
    colors_tps = ["#2ecc71" if t == max(tps_values)
                   else "#3498db" for t in tps_values]
    ax2.barh(names, tps_values, color=colors_tps)
    ax2.set_xlabel("Tokens per Second")
    ax2.set_title("Throughput (higher is better)")
    ax2.invert_yaxis()
    for i, v in enumerate(tps_values):
        ax2.text(v + 5, i, f"{v:.0f}", va="center")

    plt.tight_layout()
    plt.savefig("benchmark_results.png", dpi=150,
                bbox_inches="tight")
    plt.show()

plot_benchmark(benchmark_results)

Note: These numbers are just examples. Real speed depends on time of day, server load, your network, and model version. Run the test yourself for numbers that matter to you.

Adding Cost Comparison

Speed without cost is half the story. A host that’s 10x faster but 10x pricier might lose for batch jobs.

These prices are March 2026 list rates for Llama 3.1 8B. Prices change often. Check each host’s pricing page before you decide.

# Cost per million tokens (March 2026 rates)
PRICING = {
    "Groq": {"input_per_m": 0.05, "output_per_m": 0.08},
    "Fireworks AI": {"input_per_m": 0.10, "output_per_m": 0.10},
    "Together AI": {"input_per_m": 0.10, "output_per_m": 0.10},
    "Replicate": {"input_per_m": 0.05, "output_per_m": 0.25},
}

def add_cost_analysis(results, pricing):
    """Add cost comparison to benchmark results."""
    print("\n=== Cost per 1M Tokens ===\n")

    headers = [
        "Provider", "Input $/1M", "Output $/1M",
        "Blended $/1M*", "Speed Rank", "Cost Rank"
    ]
    rows = []

    speed_ranked = sorted(
        results.items(),
        key=lambda x: x[1]["median_time_s"]
    )
    speed_ranks = {
        name: i + 1
        for i, (name, _) in enumerate(speed_ranked)
    }

    cost_data = []
    for name, prices in pricing.items():
        blended = (prices["input_per_m"] + prices["output_per_m"]) / 2
        cost_data.append((name, blended))
    cost_ranked = sorted(cost_data, key=lambda x: x[1])
    cost_ranks = {
        name: i + 1
        for i, (name, _) in enumerate(cost_ranked)
    }

    for name, prices in pricing.items():
        blended = (prices["input_per_m"] + prices["output_per_m"]) / 2
        rows.append([
            name,
            f"${prices['input_per_m']:.2f}",
            f"${prices['output_per_m']:.2f}",
            f"${blended:.3f}",
            speed_ranks.get(name, "-"),
            cost_ranks.get(name, "-")
        ])

    rows.sort(key=lambda x: float(x[3].replace("$", "")))
    print(tabulate(rows, headers=headers, tablefmt="grid"))
    print("\n* Blended = (input + output) / 2")

add_cost_analysis(benchmark_results, PRICING)

Key Insight: No one host wins on every front. Groq leads on speed. Fireworks and Together fight it out on throughput and model choice. Replicate shines for non-text models. Pick based on your task, not one number.

Building a Decision Function

Tables help humans compare. For live routing, you want a function that scores each service based on what matters to you. Pass weights for speed, throughput, and cost. It scales each metric from 0 to 1 and ranks the results.

Why scale the values? Raw time (in seconds) and cost (in dollars) aren’t on the same axis. Scaling fixes that so the weights do their job.

def recommend_provider(results, pricing,
                       w_latency=0.4,
                       w_throughput=0.3,
                       w_cost=0.3):
    """Score and rank providers by weighted priorities."""
    latencies = {
        n: r["median_time_s"] for n, r in results.items()
    }
    throughputs = {
        n: r["median_tps"] for n, r in results.items()
    }
    costs = {
        n: (p["input_per_m"] + p["output_per_m"]) / 2
        for n, p in pricing.items() if n in results
    }

    def norm_lower_better(vals):
        lo, hi = min(vals.values()), max(vals.values())
        if hi == lo:
            return {k: 1.0 for k in vals}
        return {
            k: round(1 - (v - lo) / (hi - lo), 3)
            for k, v in vals.items()
        }

    def norm_higher_better(vals):
        lo, hi = min(vals.values()), max(vals.values())
        if hi == lo:
            return {k: 1.0 for k in vals}
        return {
            k: round((v - lo) / (hi - lo), 3)
            for k, v in vals.items()
        }

    nl = norm_lower_better(latencies)
    nt = norm_higher_better(throughputs)
    nc = norm_lower_better(costs)

    scores = {}
    for name in results:
        if name in nc:
            score = (w_latency * nl[name] +
                     w_throughput * nt[name] +
                     w_cost * nc[name])
            scores[name] = round(score, 3)

    ranked = sorted(
        scores.items(), key=lambda x: x[1], reverse=True
    )

    print(f"\n=== Recommendation (lat={w_latency}, "
          f"tps={w_throughput}, cost={w_cost}) ===\n")
    for rank, (name, score) in enumerate(ranked, 1):
        bar = "#" * int(score * 30)
        print(f"  {rank}. {name:15s} {score:.3f} {bar}")
    print(f"\n  Winner: {ranked[0][0]}")

    return ranked

# Balanced priorities
recommend_provider(benchmark_results, PRICING)

Now try different scenarios. The weights shift the winner dramatically.

# Real-time chat: latency is king
print("=== Scenario: Real-time Chat ===")
recommend_provider(benchmark_results, PRICING,
                   w_latency=0.7, w_throughput=0.2, w_cost=0.1)

# Batch processing: cost matters most
print("\n=== Scenario: Batch Processing ===")
recommend_provider(benchmark_results, PRICING,
                   w_latency=0.1, w_throughput=0.3, w_cost=0.6)

For live chat, Groq wins because speed matters most. For batch jobs, the cheapest per-token host takes the lead. That’s the whole point — your task picks the winner.

Try It Yourself

Exercise 1: Add a Fifth Provider

Add Cerebras to the benchmark. Cerebras uses an OpenAI-compatible endpoint at https://api.cerebras.ai/v1/chat/completions. Their Llama 3.1 8B model is llama3.1-8b. Write a call_cerebras() function, add it to the providers list, and re-run.

Hints

1. Copy the `call_groq` function. Change the URL, model name, and provider label.
2. Add `{“name”: “Cerebras”, “call_fn”: call_cerebras, “api_key”: os.environ.get(“CEREBRAS_API_KEY”, “your-key”)}` to the providers list.

Solution
import micropip
await micropip.install(["requests"])

def call_cerebras(prompt, api_key):
    """Call Cerebras OpenAI-compatible endpoint."""
    url = "https://api.cerebras.ai/v1/chat/completions"
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    payload = {
        "model": "llama3.1-8b",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 256,
        "temperature": 0.0
    }
    start = time.time()
    response = requests.post(url, headers=headers, json=payload)
    total_time = time.time() - start
    data = response.json()
    text = data["choices"][0]["message"]["content"]
    tokens = data["usage"]["completion_tokens"]
    return {
        "provider": "Cerebras",
        "text": text,
        "total_time_s": round(total_time, 3),
        "completion_tokens": tokens,
        "tokens_per_second": round(tokens / total_time, 1)
    }

These hosts swap in and out at the API level. Only the URL, model name, and key change. That’s why this setup works so well.

Measuring Time-to-First-Token with Streaming

Total time tells you how long the full reply took. But for chat apps, what counts is how fast the first word shows up. That’s time-to-first-token (TTFT).

To track TTFT, turn on streaming and note the time when the first data chunk lands. Here’s how it works with Groq. The same trick works for Fireworks and Together.

import micropip
await micropip.install(["requests"])

def call_groq_streaming(prompt, api_key):
    """Measure TTFT and total time with streaming."""
    url = "https://api.groq.com/openai/v1/chat/completions"
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    payload = {
        "model": "llama-3.1-8b-instant",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 256,
        "temperature": 0.0,
        "stream": True
    }

    start = time.time()
    response = requests.post(
        url, headers=headers, json=payload, stream=True
    )

    ttft = None
    chunks = []
    for line in response.iter_lines():
        if line:
            if ttft is None:
                ttft = time.time() - start
            decoded = line.decode("utf-8")
            if decoded.startswith("data: "):
                chunk_data = decoded[6:]
                if chunk_data != "[DONE]":
                    data = json.loads(chunk_data)
                    delta = data["choices"][0]["delta"]
                    if "content" in delta:
                        chunks.append(delta["content"])

    total_time = time.time() - start
    text = "".join(chunks)

    return {
        "provider": "Groq (streaming)",
        "ttft_s": round(ttft, 3) if ttft else None,
        "total_time_s": round(total_time, 3),
        "text": text
    }

# Try it
result = call_groq_streaming(
    PROMPT,
    os.environ.get("GROQ_API_KEY", "your-key")
)
print(f"TTFT: {result['ttft_s']}s")
print(f"Total: {result['total_time_s']}s")

Groq’s TTFT is usually under 100ms. That’s why Groq chat apps feel instant — the first word shows up before the user finishes reading the prompt.

Key Insight: TTFT matters more than total time for chat apps. Users judge “fast” by when the first word shows up, not when the last word lands. A 2-second reply with 50ms TTFT feels faster than a 1-second reply with 500ms TTFT.

When to Use Which Provider

Benchmarks give you numbers. Real decisions depend on context. Here’s how I’d match providers to workloads.

Groq — Real-time, user-facing experiences.
Groq’s LPU delivers the lowest latency. Building a chatbot or voice assistant? Test Groq first. The tradeoff: fewer model options than Together or Fireworks.

Fireworks AI — Throughput-heavy and multi-modal workloads.
Fireworks shines at batch jobs and high traffic. Their FireAttention engine handles many requests at once. They also support tool calling, JSON output, and image models.

Together AI — Model variety and fine-tuning.
Together hosts 200+ open models and lets you fine-tune them. Need to try a bunch of models or serve a custom one? Together has the biggest menu. Their Turbo models are fast too.

Replicate — Non-text models and rapid prototyping.
Replicate’s model shop has image, video, audio, and text models. The poll-based API and cold starts slow it down for text work. But if your pipeline mixes Stable Diffusion, Whisper, and LLMs, Replicate keeps it all in one place.

Warning: Don’t test once and forget. These hosts update their gear all the time. Groq might double speed next month. Together might cut prices. Re-run your test every few months or when you spot changes.

Common Mistakes and How to Fix Them

Mistake 1: Benchmarking on cold starts

Wrong:

# First call includes container spin-up
result = call_replicate(prompt, api_key)
print(f"Latency: {result['total_time_s']}s")

Why it’s wrong: The first call to a cloud host can take 5-30 seconds to boot up. That blows up the time by 10x.

Correct:

# Warm-up call first (discard the result)
call_replicate(prompt, api_key)
result = call_replicate(prompt, api_key)
print(f"Latency: {result['total_time_s']}s")

Mistake 2: Running a single trial

Wrong:

result = call_groq(prompt, api_key)
print(f"Groq does {result['tokens_per_second']} tok/s!")

Why it’s wrong: Network noise causes 2-3x swings between calls. One trial is just noise, not a real signal.

Correct:

trials = [call_groq(prompt, api_key) for _ in range(5)]
median_tps = sorted(
    t["tokens_per_second"] for t in trials
)[2]
print(f"Groq median: {median_tps} tok/s")

Mistake 3: Comparing different model sizes

Wrong: Benchmarking Groq’s Llama 3.1 8B against Together’s Llama 3.1 70B and calling Groq faster. Different sizes have fundamentally different compute needs.

Correct: Always compare the same model — same parameter count, same quantization — across all providers.

Try It Yourself

Exercise 2: Compute Cost Per Batch

Write a function that estimates the total cost of running 10,000 prompts through each provider. Assume 50 input tokens and 150 output tokens per prompt. Use the PRICING dictionary.

Hints

1. Cost per prompt = `(input_tokens * input_price / 1_000_000) + (output_tokens * output_price / 1_000_000)`.
2. Multiply by the number of prompts for the batch total.

Solution
def cost_per_batch(pricing, num_prompts=10000,
                   input_tokens=50, output_tokens=150):
    """Estimate batch cost per provider."""
    print(f"\nCost for {num_prompts:,} prompts "
          f"({input_tokens} in + {output_tokens} out):\n")

    for name, prices in pricing.items():
        input_cost = (num_prompts * input_tokens
                      * prices["input_per_m"] / 1_000_000)
        output_cost = (num_prompts * output_tokens
                       * prices["output_per_m"] / 1_000_000)
        total = input_cost + output_cost
        print(f"  {name:15s}: ${total:.4f}")

cost_per_batch(PRICING)

At 10,000 prompts the differences are pennies. At 10 million, they become hundreds of dollars. Provider choice matters at scale.

Complete Code

Click to expand the full script (copy-paste and run)
import micropip
await micropip.install(["requests"])

# Complete code from: Inference Providers Compared
# Requires: pip install requests tabulate matplotlib
# Python 3.9+

import requests
import time
import json
import os
from tabulate import tabulate
import matplotlib.pyplot as plt

# --- Config ---
PROMPT = "Explain gradient descent in 3 sentences for a beginner."

PRICING = {
    "Groq": {"input_per_m": 0.05, "output_per_m": 0.08},
    "Fireworks AI": {"input_per_m": 0.10, "output_per_m": 0.10},
    "Together AI": {"input_per_m": 0.10, "output_per_m": 0.10},
    "Replicate": {"input_per_m": 0.05, "output_per_m": 0.25},
}

# --- Provider Functions ---
def call_groq(prompt, api_key):
    url = "https://api.groq.com/openai/v1/chat/completions"
    headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
    payload = {"model": "llama-3.1-8b-instant", "messages": [{"role": "user", "content": prompt}], "max_tokens": 256, "temperature": 0.0}
    start = time.time()
    resp = requests.post(url, headers=headers, json=payload)
    t = time.time() - start
    d = resp.json()
    return {"provider": "Groq", "text": d["choices"][0]["message"]["content"], "total_time_s": round(t, 3), "completion_tokens": d["usage"]["completion_tokens"], "tokens_per_second": round(d["usage"]["completion_tokens"] / t, 1)}

def call_fireworks(prompt, api_key):
    url = "https://api.fireworks.ai/inference/v1/chat/completions"
    headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
    payload = {"model": "accounts/fireworks/models/llama-v3p1-8b-instruct", "messages": [{"role": "user", "content": prompt}], "max_tokens": 256, "temperature": 0.0}
    start = time.time()
    resp = requests.post(url, headers=headers, json=payload)
    t = time.time() - start
    d = resp.json()
    return {"provider": "Fireworks AI", "text": d["choices"][0]["message"]["content"], "total_time_s": round(t, 3), "completion_tokens": d["usage"]["completion_tokens"], "tokens_per_second": round(d["usage"]["completion_tokens"] / t, 1)}

def call_together(prompt, api_key):
    url = "https://api.together.xyz/v1/chat/completions"
    headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
    payload = {"model": "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo", "messages": [{"role": "user", "content": prompt}], "max_tokens": 256, "temperature": 0.0}
    start = time.time()
    resp = requests.post(url, headers=headers, json=payload)
    t = time.time() - start
    d = resp.json()
    return {"provider": "Together AI", "text": d["choices"][0]["message"]["content"], "total_time_s": round(t, 3), "completion_tokens": d["usage"]["completion_tokens"], "tokens_per_second": round(d["usage"]["completion_tokens"] / t, 1)}

def call_replicate(prompt, api_key):
    url = "https://api.replicate.com/v1/predictions"
    headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
    payload = {"model": "meta/meta-llama-3.1-8b-instruct", "input": {"prompt": prompt, "max_tokens": 256, "temperature": 0.0}}
    start = time.time()
    resp = requests.post(url, headers=headers, json=payload)
    pred = resp.json()
    poll_url = pred["urls"]["get"]
    while True:
        s = requests.get(poll_url, headers=headers).json()
        if s["status"] == "succeeded": break
        elif s["status"] == "failed": raise RuntimeError(f"Failed: {s.get('error')}")
        time.sleep(0.5)
    t = time.time() - start
    text = "".join(s["output"])
    est = int(len(text.split()) * 1.3)
    return {"provider": "Replicate", "text": text, "total_time_s": round(t, 3), "completion_tokens": est, "tokens_per_second": round(est / t, 1)}

def call_with_retry(fn, prompt, key, max_retries=3):
    for attempt in range(max_retries + 1):
        try: return fn(prompt, key)
        except requests.exceptions.HTTPError as e:
            if e.response.status_code == 429 and attempt < max_retries: time.sleep(2 ** attempt)
            else: raise
        except requests.exceptions.ConnectionError:
            if attempt < max_retries: time.sleep(2 ** attempt)
            else: raise

def run_benchmark(providers, prompt, num_trials=5):
    results = {}
    for p in providers:
        name, fn, key = p["name"], p["call_fn"], p["api_key"]
        trials = []
        print(f"\nBenchmarking {name}...")
        try: call_with_retry(fn, prompt, key)
        except: continue
        for i in range(num_trials):
            try:
                r = call_with_retry(fn, prompt, key)
                if r: trials.append(r); print(f"  Trial {i+1}: {r['total_time_s']}s, {r['tokens_per_second']} tok/s")
            except Exception as e: print(f"  Trial {i+1} failed: {e}")
        if trials:
            trials.sort(key=lambda x: x["total_time_s"])
            m = trials[len(trials)//2]
            ts = [t["total_time_s"] for t in trials]
            sp = [t["tokens_per_second"] for t in trials]
            results[name] = {"median_time_s": m["total_time_s"], "median_tps": m["tokens_per_second"], "min_time_s": min(ts), "max_time_s": max(ts), "p95_time_s": ts[-1], "avg_tps": round(sum(sp)/len(sp), 1), "trials": len(trials)}
    return results

# --- Run ---
providers = [
    {"name": "Groq", "call_fn": call_groq, "api_key": os.environ.get("GROQ_API_KEY", "your-key")},
    {"name": "Fireworks AI", "call_fn": call_fireworks, "api_key": os.environ.get("FIREWORKS_API_KEY", "your-key")},
    {"name": "Together AI", "call_fn": call_together, "api_key": os.environ.get("TOGETHER_API_KEY", "your-key")},
    {"name": "Replicate", "call_fn": call_replicate, "api_key": os.environ.get("REPLICATE_API_TOKEN", "your-key")},
]
results = run_benchmark(providers, PROMPT, num_trials=5)

# Display
headers = ["Provider", "Median (s)", "Median TPS", "Avg TPS", "Trials"]
rows = [[n, s["median_time_s"], s["median_tps"], s["avg_tps"], s["trials"]] for n, s in results.items()]
rows.sort(key=lambda x: x[1])
print(tabulate(rows, headers=headers, tablefmt="grid"))

print("\nScript completed successfully.")

Summary

You now have a working benchmark harness that compares inference providers on latency, throughput, TTFT, and cost. Here’s what matters:

  1. Same model, different provider = different performance. Hardware and serving optimizations create real gaps.
  2. Use the median of multiple trials. Single calls are noise.
  3. Measure TTFT for user-facing apps. It matters more than total latency.
  4. Match the provider to your workload. Groq for real-time, Fireworks for throughput, Together for variety, Replicate for non-text.
  5. Re-benchmark regularly. The landscape shifts every quarter.

Practice exercise: Extend the benchmark to test longer prompts (500+ input tokens) and longer outputs (1000+ tokens). You’ll find the performance gaps change — some providers optimize for short exchanges while others handle long-form generation better.

Solution sketch
LONG_PROMPT = """You are a technical writer. Write a detailed
explanation of how transformers work in deep learning. Cover
the attention mechanism, positional encoding, and the
encoder-decoder architecture. Include mathematical notation
where appropriate."""

long_results = run_benchmark(
    providers, LONG_PROMPT, num_trials=3
)
display_results(long_results)

With longer inputs and outputs, throughput matters more than TTFT. Providers optimized for high throughput may close the gap with Groq.

Frequently Asked Questions

Can I use the OpenAI Python SDK with these providers?

Yes — for Groq, Fireworks, and Together. Set base_url to the provider’s endpoint and pass your API key. Replicate needs its own SDK or raw HTTP.

import micropip
await micropip.install(["openai"])

from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key="your-groq-key"
)

Do these providers offer streaming responses?

All four can stream. For the OpenAI-style hosts, add "stream": true to the payload. You get chunks of tokens as they’re made. Replicate streams through its own API. Streaming helps users feel the app is fast because words pop up one by one.

How do I handle rate limits in production?

Add retry logic with rising wait times. Wait 1s after the first 429 error, then 2s, then 4s. Most limits clear fast. For high volume, upgrade to paid plans or spread the load across hosts.

Should I use one provider or multiple?

For live apps, route to more than one host. Send real-time calls to the fastest. Route batch jobs to the cheapest. This “LLM routing” trick stops vendor lock-in and gives you a backup when one host goes down.

References

  1. Groq documentation — API Reference. Link
  2. Fireworks AI documentation — OpenAI Compatibility. Link
  3. Together AI documentation — Quickstart. Link
  4. Replicate documentation — HTTP API. Link
  5. Artificial Analysis — LLM Performance Leaderboard. Link
  6. Cerebras — Llama 3.1 Model Quality Evaluation. Link
  7. GoPenAI — Token Arbitrage Benchmark: Groq vs DeepInfra vs Cerebras vs Fireworks. Link

Meta description: Build a Python benchmark harness comparing Groq, Fireworks, Together AI, and Replicate on latency, throughput, and cost with runnable code examples.

[SCHEMA HINTS]
– Article type: Comparison / Tutorial
– Primary technology: Groq API, Fireworks AI API, Together AI API, Replicate API
– Programming language: Python
– Difficulty: Intermediate
– Keywords: inference providers benchmark, Groq vs Fireworks vs Together AI, LLM inference comparison, inference provider latency throughput cost

Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Free Callback - Limited Slots
Not Sure Which Course to Start With?
Talk to our AI Counsellors and Practitioners. We'll help you clear all your questions for your background and goals, bridging the gap between your current skills and a career in AI.
10-digit mobile number
📞
Thank You!
We'll Call You Soon!
Our learning advisor will reach out within 24 hours.
(Check your inbox too — we've sent a confirmation)
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science