Menu

OpenAI, Claude & Gemini API Tutorial in Python (2026)

Learn to call OpenAI, Claude, and Gemini APIs from Python in 15 minutes. Includes code examples, error handling, streaming, and a unified wrapper.

Written by Selva Prabhakaran | 27 min read

You’ve heard about GPT, Claude, and Gemini. But have you actually called one from your own Python script? It’s way easier than you think — and you’ll have all three running before your coffee gets cold.

This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.

What Is an LLM API?

Every time you type into ChatGPT, your browser sends an HTTP request to an API behind the scenes. The chat window? Just a wrapper. The real power sits behind it.

An API (Application Programming Interface) lets you talk to these models from code. You send a message, the model sends a reply. That’s the whole idea.

Why should you care? Because the API gives you control the chat window never will.

You pick the model. You set how creative or deterministic the output is. You define the system prompt. You parse the response however you want. You can process a thousand documents while you sleep.

Three providers dominate right now:

ProviderTop ModelsKnown For
OpenAIGPT-4o, GPT-4o-miniLargest ecosystem, widest adoption
AnthropicClaude Sonnet 4, Claude Haiku 3.5Precise instruction-following, long context
GoogleGemini 2.0 Flash, Gemini 1.5 ProMultimodal tasks, generous free tier

By the end of this tutorial, you’ll call all three from Python and compare their responses side-by-side.

Prerequisites

  • Python version: 3.10+
  • Required library: requests (comes with most Python setups; pip install requests if needed)
  • API keys: One from each provider (setup below)
  • Time to complete: ~15 minutes
  • Cost: Under $0.01 total

Get Your API Keys

Before any code runs, you need API keys. Each provider gives you one. It’s how they know who’s making the request and who to bill.

OpenAI:
1. Go to platform.openai.com/api-keys
2. Click “Create new secret key”
3. Copy it immediately — you won’t see it again

Anthropic (Claude):
1. Go to console.anthropic.com/settings/keys
2. Click “Create Key”
3. Copy and save it

Google (Gemini):
1. Go to aistudio.google.com/apikey
2. Click “Create API Key”
3. Select a project and copy the key

Free tier alert: Google gives a generous free tier for Gemini. OpenAI and Anthropic give small signup credits. This whole tutorial costs under $0.01.

Store your keys as environment variables. Never hardcode them in scripts you’ll share or commit.

The cleanest approach is a .env file with python-dotenv. Here’s the full setup:

import micropip
await micropip.install('requests')

# First: pip install python-dotenv

# Create a file called .env in your project folder:
# OPENAI_API_KEY=sk-...
# ANTHROPIC_API_KEY=sk-ant-...
# GOOGLE_API_KEY=AIza...

from dotenv import load_dotenv
import os

load_dotenv()  # reads .env into environment variables

# Now these work:
print(os.environ.get("OPENAI_API_KEY", "")[:8] + "...")
python
sk-proj-...

If you’d rather skip the .env file for now, you can set keys directly in Python. It’s fine for learning — just don’t do it in production code.

import os

os.environ["OPENAI_API_KEY"] = "your-key-here"
os.environ["ANTHROPIC_API_KEY"] = "your-key-here"
os.environ["GOOGLE_API_KEY"] = "your-key-here"

Never commit API keys to git. Add .env to your .gitignore file. Leaked keys get abused within minutes.

Your First LLM API Call — OpenAI

Here’s the core idea behind every LLM API call in Python. You send a list of messages with roles. The model reads them and responds.

There are three roles:

  • system — tells the model HOW to behave (personality, rules, constraints)
  • user — that’s you, or your app’s user
  • assistant — the model’s previous responses (used for multi-turn chats)

Think of it like a script for a play. Each message has a speaker and their line. The model reads the whole script and writes the next line.

We’ll use the requests library for all API calls. No SDKs needed. This approach runs anywhere — even in the browser with Pyodide.

The code below sends one user message with a system prompt to GPT-4o-mini. Pay attention to two things: the messages array structure and the Authorization header format.

import requests
import os

OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", "")

response = requests.post(
    "https://api.openai.com/v1/chat/completions",
    headers={
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "Content-Type": "application/json"
    },
    json={
        "model": "gpt-4o-mini",
        "messages": [
            {"role": "system", "content": "You are a helpful Python tutor."},
            {"role": "user", "content": "Explain list comprehensions in one sentence."}
        ],
        "temperature": 0.7,
        "max_tokens": 150
    }
)

data = response.json()
print(data["choices"][0]["message"]["content"])
python
A list comprehension is a concise way to create a new list by applying an expression to each item in an iterable, optionally filtering items with a condition, all in a single readable line.

That’s it. You just called GPT-4o-mini from Python. Your first LLM API request is done.

Two parameters to know right away:

  • temperature controls randomness. Set it to 0 for deterministic output, 1.0 for creative output. I usually start at 0.7.
  • max_tokens caps the response length. Always set this — otherwise the model might generate thousands of tokens, and you pay for every one.

What the Response JSON Looks Like

The response has useful metadata beyond just the text. Let’s look at the full structure so you know what’s available.

import json

print(json.dumps(data, indent=2))
json
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "gpt-4o-mini",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "A list comprehension is a concise way to..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 28,
    "completion_tokens": 42,
    "total_tokens": 70
  }
}

Three fields worth bookmarking:

  • choices[0].message.content — the actual response text
  • choices[0].finish_reason"stop" means it finished naturally; "length" means it hit your max_tokens limit
  • usage — token counts for billing (input + output)

You pay for every token, input AND output. A token is roughly 4 characters in English. The usage field tells you exactly how many tokens each request consumed.

Call the Claude API

Claude’s API follows the same concept, but Anthropic made a few design choices that’ll trip you up if you’re not ready for them. The biggest one: the system prompt lives outside the messages array.

Here are the four key differences from OpenAI you should watch for: the x-api-key header, the anthropic-version header, the top-level system field, and the response path (content[0].text instead of choices[0].message.content).

ANTHROPIC_API_KEY = os.environ.get("ANTHROPIC_API_KEY", "")

response = requests.post(
    "https://api.anthropic.com/v1/messages",
    headers={
        "x-api-key": ANTHROPIC_API_KEY,
        "anthropic-version": "2023-06-01",
        "Content-Type": "application/json"
    },
    json={
        "model": "claude-sonnet-4-20250514",
        "max_tokens": 150,
        "system": "You are a helpful Python tutor.",
        "messages": [
            {"role": "user", "content": "Explain list comprehensions in one sentence."}
        ]
    }
)

data = response.json()
print(data["content"][0]["text"])
python
A list comprehension lets you build a new list by writing a for-loop and an optional if-filter inside square brackets, replacing what would otherwise take three or four lines of code with a single expressive line.

Here’s a quick summary of the differences:

FeatureOpenAIClaude
Auth headerAuthorization: Bearer KEYx-api-key: KEY
Version headerNot requiredanthropic-version (required)
System promptInside messages arrayTop-level system field
Response pathchoices[0].message.contentcontent[0].text
Token fieldstotal_tokensinput_tokens + output_tokens

These differences are small, but they’ll bite you if you switch providers without checking the docs first.

Call the Gemini API

Google’s Gemini API has the most different structure of the three. The model name goes in the URL itself. Messages use parts instead of content.

Why parts? Because Gemini was built multimodal from the start. You can mix text, images, and audio in the same message. That parts array supports all of them. For plain text it feels verbose, but it shines when you start sending images.

Watch for these differences: API key in the URL (not headers), contents instead of messages, candidates instead of choices, and camelCase parameter names.

GOOGLE_API_KEY = os.environ.get("GOOGLE_API_KEY", "")

response = requests.post(
    f"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key={GOOGLE_API_KEY}",
    headers={"Content-Type": "application/json"},
    json={
        "system_instruction": {
            "parts": [{"text": "You are a helpful Python tutor."}]
        },
        "contents": [
            {
                "role": "user",
                "parts": [{"text": "Explain list comprehensions in one sentence."}]
            }
        ],
        "generationConfig": {
            "temperature": 0.7,
            "maxOutputTokens": 150
        }
    }
)

data = response.json()
print(data["candidates"][0]["content"]["parts"][0]["text"])
python
List comprehensions provide a compact syntax for creating lists by applying an expression to each element of an iterable, with an optional filtering condition, all written within square brackets on a single line.

You’ve now called all three providers. Same question, three slightly different answers. Each has its own response format, but the core idea is identical: send messages, get text back.

Compare All Three Side-by-Side

This is where things get interesting. Let’s send the same prompt to all three and compare responses, speed, and token usage in one shot.

I’ll build a helper function for each provider. Each wraps the API call, times it, and returns a consistent dictionary. That way you can loop through providers without juggling format differences.

Here’s the OpenAI helper. It grabs the API key, posts the request, and packages the result into a clean dictionary with provider, response, tokens, and latency:

import time

def call_openai(prompt, system="You are a helpful assistant."):
    start = time.time()
    resp = requests.post(
        "https://api.openai.com/v1/chat/completions",
        headers={
            "Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY', '')}",
            "Content-Type": "application/json"
        },
        json={
            "model": "gpt-4o-mini",
            "messages": [
                {"role": "system", "content": system},
                {"role": "user", "content": prompt}
            ],
            "temperature": 0.7, "max_tokens": 200
        }
    )
    elapsed = time.time() - start
    data = resp.json()
    return {
        "provider": "OpenAI (gpt-4o-mini)",
        "response": data["choices"][0]["message"]["content"],
        "tokens": data["usage"]["total_tokens"],
        "latency": round(elapsed, 2)
    }

The Claude helper follows the same pattern but swaps in the Anthropic header format and response path:

def call_claude(prompt, system="You are a helpful assistant."):
    start = time.time()
    resp = requests.post(
        "https://api.anthropic.com/v1/messages",
        headers={
            "x-api-key": os.environ.get("ANTHROPIC_API_KEY", ""),
            "anthropic-version": "2023-06-01",
            "Content-Type": "application/json"
        },
        json={
            "model": "claude-sonnet-4-20250514",
            "max_tokens": 200,
            "system": system,
            "messages": [{"role": "user", "content": prompt}]
        }
    )
    elapsed = time.time() - start
    data = resp.json()
    return {
        "provider": "Claude (claude-sonnet-4)",
        "response": data["content"][0]["text"],
        "tokens": data["usage"]["input_tokens"] + data["usage"]["output_tokens"],
        "latency": round(elapsed, 2)
    }

And the Gemini helper — note the API key in the URL and the nested parts structure:

def call_gemini(prompt, system="You are a helpful assistant."):
    start = time.time()
    resp = requests.post(
        f"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key={os.environ.get('GOOGLE_API_KEY', '')}",
        headers={"Content-Type": "application/json"},
        json={
            "system_instruction": {"parts": [{"text": system}]},
            "contents": [{"role": "user", "parts": [{"text": prompt}]}],
            "generationConfig": {"temperature": 0.7, "maxOutputTokens": 200}
        }
    )
    elapsed = time.time() - start
    data = resp.json()
    return {
        "provider": "Gemini (gemini-2.0-flash)",
        "response": data["candidates"][0]["content"]["parts"][0]["text"],
        "tokens": data.get("usageMetadata", {}).get("totalTokenCount", 0),
        "latency": round(elapsed, 2)
    }

With all three helpers ready, here’s the comparison. Same prompt, three providers, printed together:

prompt = "What are the top 3 tips for writing clean Python code? Be concise."

results = [call_openai(prompt), call_claude(prompt), call_gemini(prompt)]

for r in results:
    print(f"\n{'='*60}")
    print(f"Provider: {r['provider']}")
    print(f"Latency:  {r['latency']}s | Tokens: {r['tokens']}")
    print(f"{'='*60}")
    print(r["response"])

Your output will look something like this (exact responses and timings vary with each run):

python
============================================================
Provider: OpenAI (gpt-4o-mini)
Latency:  1.2s | Tokens: 90
============================================================
1. Use meaningful variable names that describe what they hold.
2. Follow PEP 8 style guidelines for consistent formatting.
3. Write small, focused functions that do one thing well.

============================================================
Provider: Claude (claude-sonnet-4)
Latency:  1.4s | Tokens: 95
============================================================
1. Use descriptive names -- variables, functions, and classes
   should reveal their purpose without needing comments.
2. Keep functions short and focused -- each function should
   do exactly one thing.
3. Follow PEP 8 -- consistent style makes code readable for
   everyone, including future you.

============================================================
Provider: Gemini (gemini-2.0-flash)
Latency:  0.9s | Tokens: 80
============================================================
1. Use descriptive variable and function names.
2. Follow PEP 8 for consistent code style.
3. Keep functions small and focused on a single task.

Same prompt, three different styles. Claude tends to give more detail. Gemini Flash is often the fastest. OpenAI lands in the middle. Your exact latencies will vary depending on network conditions and server load.

No single provider is “best” for everything. OpenAI has the largest ecosystem. Claude excels at instruction-following and long documents. Gemini offers the best price-to-performance with native multimodal support. I usually start with the cheapest model and only upgrade when it falls short.

Exercise 1: Call All Three Providers

Task: Change the prompt to ask: “What is the difference between a list and a tuple in Python? Answer in exactly 2 sentences.”

Compare the three responses. Which provider followed “exactly 2 sentences” most precisely?

Hints

1. Just change the `prompt` variable — the helper functions handle the rest.
2. Count sentences by splitting on `. ` (period + space).
3. Try it before checking the solution!

Solution
prompt = "What is the difference between a list and a tuple in Python? Answer in exactly 2 sentences."

results = [call_openai(prompt), call_claude(prompt), call_gemini(prompt)]

for r in results:
    text = r["response"]
    sentences = [s.strip() for s in text.replace(".\n", ". ").split(". ") if s.strip()]
    print(f"{r['provider']}: {len(sentences)} sentences")
    print(f"  {text}\n")

**What you’ll typically notice:** Claude tends to follow the “exactly 2 sentences” constraint most reliably. GPT-4o-mini sometimes adds a third. Gemini Flash sometimes merges into one long sentence. Instruction-following varies between providers — that’s a real difference worth knowing when you pick one for your project.


Handle Errors Like a Professional

API calls fail. Networks drop. Rate limits hit. Keys expire. If your code doesn’t handle these, it’ll crash at the worst possible time.

Here’s what production-quality error handling looks like for OpenAI. The key additions: check the API key before calling, set a timeout, and handle specific HTTP status codes with clear messages.

def safe_call_openai(prompt, system="You are a helpful assistant."):
    api_key = os.environ.get("OPENAI_API_KEY", "")
    if not api_key:
        return {"error": "OPENAI_API_KEY not set"}

    try:
        resp = requests.post(
            "https://api.openai.com/v1/chat/completions",
            headers={
                "Authorization": f"Bearer {api_key}",
                "Content-Type": "application/json"
            },
            json={
                "model": "gpt-4o-mini",
                "messages": [
                    {"role": "system", "content": system},
                    {"role": "user", "content": prompt}
                ],
                "max_tokens": 200
            },
            timeout=30
        )

        if resp.status_code == 401:
            return {"error": "Invalid API key."}
        elif resp.status_code == 429:
            return {"error": "Rate limit hit. Wait and retry."}
        elif resp.status_code != 200:
            return {"error": f"HTTP {resp.status_code}: {resp.text[:200]}"}

        data = resp.json()
        return {
            "content": data["choices"][0]["message"]["content"],
            "tokens": data["usage"]["total_tokens"]
        }

    except requests.exceptions.Timeout:
        return {"error": "Request timed out after 30s."}
    except requests.exceptions.ConnectionError:
        return {"error": "No internet connection."}
    except Exception as e:
        return {"error": f"Unexpected: {str(e)}"}

Test it like this:

result = safe_call_openai("Say hello in 5 words.")
if "error" in result:
    print(f"Failed: {result['error']}")
else:
    print(f"Response: {result['content']}")
    print(f"Tokens used: {result['tokens']}")
python
Response: Hello there, how are you?
Tokens used: 35

The three most common errors you’ll hit:

  1. 401 Unauthorized — wrong or expired API key
  2. 429 Rate Limited — too many requests too fast
  3. 500/503 Server Error — provider is having issues (retry after a short wait)

Always set a timeout. Without it, a hung connection blocks your code forever. I’ve seen scripts freeze for hours because someone forgot this one parameter.

Exercise 2: Add Error Handling for Claude

Task: Write a safe_call_claude function that mirrors safe_call_openai. Handle the same error cases: missing key, auth error, rate limit, and timeout.

Hints

1. Claude uses the `x-api-key` header. A 401 means the key is invalid.
2. Keep the same return structure: `{“content”: …, “tokens”: …}` on success, `{“error”: …}` on failure.
3. The response path is `data[“content”][0][“text”]`, and tokens are `data[“usage”][“input_tokens”] + data[“usage”][“output_tokens”]`.

Solution
def safe_call_claude(prompt, system="You are a helpful assistant."):
    api_key = os.environ.get("ANTHROPIC_API_KEY", "")
    if not api_key:
        return {"error": "ANTHROPIC_API_KEY not set"}
    try:
        resp = requests.post(
            "https://api.anthropic.com/v1/messages",
            headers={
                "x-api-key": api_key,
                "anthropic-version": "2023-06-01",
                "Content-Type": "application/json"
            },
            json={
                "model": "claude-sonnet-4-20250514",
                "max_tokens": 200,
                "system": system,
                "messages": [{"role": "user", "content": prompt}]
            },
            timeout=30
        )
        if resp.status_code == 401:
            return {"error": "Invalid Anthropic API key."}
        elif resp.status_code == 429:
            return {"error": "Rate limit hit. Wait and retry."}
        elif resp.status_code != 200:
            return {"error": f"HTTP {resp.status_code}: {resp.text[:200]}"}
        data = resp.json()
        return {
            "content": data["content"][0]["text"],
            "tokens": data["usage"]["input_tokens"] + data["usage"]["output_tokens"]
        }
    except requests.exceptions.Timeout:
        return {"error": "Request timed out after 30s."}
    except requests.exceptions.ConnectionError:
        return {"error": "No internet connection."}
    except Exception as e:
        return {"error": f"Unexpected: {str(e)}"}

The Gemini version follows the same pattern. One twist: Gemini puts the key in the URL, so a 400 with `”API_KEY_INVALID”` in the body means bad key (instead of a 401).


Build a Unified LLM API Wrapper

Remembering three different API formats gets old fast. What if you could call any provider with one function?

That’s exactly what we’ll build. Pass the provider name as a string, and the function picks the right endpoint, headers, and response path for you. No more copy-pasting boilerplate.

def call_llm(prompt, provider="openai", system="You are a helpful assistant.", max_tokens=200):
    """Call any LLM provider with one consistent interface."""

    if provider == "openai":
        resp = requests.post(
            "https://api.openai.com/v1/chat/completions",
            headers={
                "Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY', '')}",
                "Content-Type": "application/json"
            },
            json={
                "model": "gpt-4o-mini",
                "messages": [
                    {"role": "system", "content": system},
                    {"role": "user", "content": prompt}
                ],
                "max_tokens": max_tokens
            },
            timeout=30
        )
        return resp.json()["choices"][0]["message"]["content"]

    elif provider == "claude":
        resp = requests.post(
            "https://api.anthropic.com/v1/messages",
            headers={
                "x-api-key": os.environ.get("ANTHROPIC_API_KEY", ""),
                "anthropic-version": "2023-06-01",
                "Content-Type": "application/json"
            },
            json={
                "model": "claude-sonnet-4-20250514",
                "max_tokens": max_tokens,
                "system": system,
                "messages": [{"role": "user", "content": prompt}]
            },
            timeout=30
        )
        return resp.json()["content"][0]["text"]

    elif provider == "gemini":
        key = os.environ.get("GOOGLE_API_KEY", "")
        resp = requests.post(
            f"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key={key}",
            headers={"Content-Type": "application/json"},
            json={
                "system_instruction": {"parts": [{"text": system}]},
                "contents": [{"role": "user", "parts": [{"text": prompt}]}],
                "generationConfig": {"maxOutputTokens": max_tokens}
            },
            timeout=30
        )
        return resp.json()["candidates"][0]["content"]["parts"][0]["text"]

    else:
        raise ValueError(f"Unknown provider: {provider}")

Three providers, one function call. Here it is in action:

for provider in ["openai", "claude", "gemini"]:
    answer = call_llm("What is Python's GIL in one sentence?", provider=provider)
    print(f"{provider:>8}: {answer}")
python
  openai: The GIL (Global Interpreter Lock) is a mutex that allows only one thread to execute Python bytecode at a time, limiting true parallelism in multi-threaded programs.
  claude: Python's GIL (Global Interpreter Lock) is a mutex that prevents multiple native threads from executing Python bytecodes simultaneously, effectively making CPU-bound multi-threaded programs run on a single core.
  gemini: The GIL is a mutex in CPython that allows only one thread to hold control of the Python interpreter at a time, limiting multi-threaded CPU-bound performance.

You could extend this to support Ollama (local models), Groq (fast inference), or any other provider. For production use, libraries like LiteLLM take this further with 100+ providers under one interface.

Multi-Turn Conversations

So far we’ve sent single messages. But real chatbots need memory — they need to know what was said before.

Here’s the trick: you maintain the conversation yourself. Every time the model responds, you append its reply to the messages list. Then you send the whole list with your next message. The model reads the full history and responds in context.

This works identically across all three providers. Here’s how it looks with OpenAI:

conversation = [
    {"role": "system", "content": "You are a helpful Python tutor."},
    {"role": "user", "content": "What does enumerate() do?"}
]

# First turn
resp = requests.post(
    "https://api.openai.com/v1/chat/completions",
    headers={
        "Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY', '')}",
        "Content-Type": "application/json"
    },
    json={"model": "gpt-4o-mini", "messages": conversation, "max_tokens": 150}
)
assistant_reply = resp.json()["choices"][0]["message"]["content"]
print("Assistant:", assistant_reply)

# Append the reply, then ask a follow-up
conversation.append({"role": "assistant", "content": assistant_reply})
conversation.append({"role": "user", "content": "Show me an example with a list of fruits."})

# Second turn -- model remembers the context
resp = requests.post(
    "https://api.openai.com/v1/chat/completions",
    headers={
        "Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY', '')}",
        "Content-Type": "application/json"
    },
    json={"model": "gpt-4o-mini", "messages": conversation, "max_tokens": 200}
)
print("Assistant:", resp.json()["choices"][0]["message"]["content"])

The model sees the full conversation history and responds accordingly. It knows you were talking about enumerate() and gives a fruit-based example.

Watch your token costs. Every turn sends the ENTIRE conversation history. A 20-message chat sends all 20 messages every time. Long conversations get expensive. In production, you’d trim older messages or summarize them.

For Claude, the only difference is that the system prompt goes in the top-level system field instead of the messages array. For Gemini, swap messages for contents and use the parts structure. The conversation pattern stays the same.

Streaming Responses

By default, you wait for the model to finish its entire response before seeing anything. Streaming changes that — you get tokens as they’re generated, one chunk at a time. It’s what makes ChatGPT feel responsive.

Here’s streaming with OpenAI. The key change: add "stream": True and read the response line by line instead of calling .json():

resp = requests.post(
    "https://api.openai.com/v1/chat/completions",
    headers={
        "Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY', '')}",
        "Content-Type": "application/json"
    },
    json={
        "model": "gpt-4o-mini",
        "messages": [{"role": "user", "content": "Write a haiku about Python."}],
        "stream": True
    },
    stream=True
)

for line in resp.iter_lines():
    if line:
        text = line.decode("utf-8")
        if text.startswith("data: ") and text != "data: [DONE]":
            chunk = json.loads(text[6:])
            delta = chunk["choices"][0]["delta"]
            if "content" in delta:
                print(delta["content"], end="", flush=True)
print()  # newline at the end

For Claude, streaming uses Server-Sent Events with different event types. Here’s the pattern:

resp = requests.post(
    "https://api.anthropic.com/v1/messages",
    headers={
        "x-api-key": os.environ.get("ANTHROPIC_API_KEY", ""),
        "anthropic-version": "2023-06-01",
        "Content-Type": "application/json"
    },
    json={
        "model": "claude-sonnet-4-20250514",
        "max_tokens": 150,
        "messages": [{"role": "user", "content": "Write a haiku about Python."}],
        "stream": True
    },
    stream=True
)

for line in resp.iter_lines():
    if line:
        text = line.decode("utf-8")
        if text.startswith("data: "):
            data = json.loads(text[6:])
            if data["type"] == "content_block_delta":
                print(data["delta"]["text"], end="", flush=True)
print()

When should you stream? Always stream in user-facing apps. Nobody likes staring at a blank screen for 3 seconds. For batch processing (no human waiting), skip streaming — it adds code complexity for no benefit.

When to Use Which Provider

After working with all three, here’s how I think about choosing:

Use CaseBest PickWhy
General tasks, wide compatibilityOpenAI GPT-4o-miniLargest ecosystem, most tutorials, cheapest capable model
Strict instruction-followingClaude Sonnet 4Follows complex prompts most reliably
Budget-sensitive projectsGemini 2.0 FlashCheapest per token, generous free tier
Long documents (100K+ tokens)Claude or GeminiBoth handle very long contexts well
Image + text inputGemini 2.0 FlashBuilt multimodal from the start
Code generationClaude Sonnet 4Strong at writing and debugging code

Don’t overthink this choice. Start with the cheapest model that handles your task. Test with 10-20 real examples. Switch only if quality falls short.

API Pricing — What Does Each Call Cost?

Understanding pricing prevents bill shock. Here’s what these models cost as of early 2026:

ModelInput (per 1M tokens)Output (per 1M tokens)Notes
GPT-4o-mini$0.15$0.60Best value for most tasks
Claude Sonnet 4$3.00$15.00Strong at instructions
Gemini 2.0 Flash$0.10$0.40Cheapest, generous free tier
GPT-4o$2.50$10.00Most capable OpenAI model
Claude Haiku 3.5$0.80$4.00Fast and affordable

To put this in perspective: 1 million tokens is roughly 750,000 words. That’s about 10 novels. Our tutorial prompts used fewer than 100 tokens each. You’d need thousands of calls before spending a dollar.

Start with the cheapest model that works. GPT-4o-mini and Gemini Flash handle most tasks well. Only upgrade to bigger models when you’ve confirmed the cheaper option isn’t good enough for your specific task.

Exercise 3: Add Cost Estimation

Task: Write a function call_llm_with_cost that wraps call_llm and returns a dictionary with response, latency, estimated_tokens, and estimated_cost.

Use this pricing (averaged input+output per million tokens): GPT-4o-mini: $0.375, Claude Sonnet 4: $9.00, Gemini Flash: $0.25.

Hints

1. Wrap `call_llm` with `time.time()` before and after for latency.
2. Estimate tokens from character count: 1 token is roughly 4 characters.
3. Cost formula: `estimated_tokens / 1_000_000 * price_per_million`.

Solution
PRICING = {
    "openai": 0.375,
    "claude": 9.00,
    "gemini": 0.25
}

def call_llm_with_cost(prompt, provider="openai", system="You are a helpful assistant."):
    start = time.time()
    response_text = call_llm(prompt, provider=provider, system=system)
    latency = round(time.time() - start, 2)

    estimated_tokens = (len(prompt) + len(response_text)) // 4
    cost = estimated_tokens / 1_000_000 * PRICING.get(provider, 0)

    return {
        "provider": provider,
        "response": response_text,
        "latency": latency,
        "estimated_tokens": estimated_tokens,
        "estimated_cost": f"${cost:.6f}"
    }

for p in ["openai", "claude", "gemini"]:
    r = call_llm_with_cost("Explain decorators in Python in 2 sentences.", provider=p)
    print(f"{r['provider']:>8} | {r['latency']}s | ~{r['estimated_tokens']} tokens | {r['estimated_cost']}")

Your output will look something like this:

python
  openai | 1.1s | ~52 tokens | $0.000020
  claude | 1.3s | ~58 tokens | $0.000522
  gemini | 0.8s | ~48 tokens | $0.000012

The character-based estimate isn’t precise. In production, you’d parse actual token counts from each provider’s response. But it’s close enough to build cost awareness early.


Quick Reference — LLM API Cheat Sheet

FeatureOpenAIClaudeGemini
Endpoint/v1/chat/completions/v1/messages/v1beta/models/{model}:generateContent
AuthAuthorization: Bearer KEYx-api-key: KEY?key=KEY in URL
System promptIn messages arrayTop-level system fieldsystem_instruction object
User message{"role": "user", "content": "..."}Same{"role": "user", "parts": [{"text": "..."}]}
Response textchoices[0].message.contentcontent[0].textcandidates[0].content.parts[0].text
Token usageusage.total_tokensusage.input_tokens + usage.output_tokensusageMetadata.totalTokenCount
Streaming"stream": true"stream": trueNot via REST (use SDK)
Temperature0–20–10–2
Max tokensmax_tokensmax_tokensmaxOutputTokens

Bookmark this table. You’ll come back to it every time you switch between providers.

Complete Code

Click to expand the full script (copy-paste and run)
# Complete code: LLM API in Python -- Call OpenAI, Claude, and Gemini
# Requires: pip install requests python-dotenv
# Python 3.10+

import os
import time
import json
import requests
from dotenv import load_dotenv

load_dotenv()

# --- Setup: API Keys ---
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", "")
ANTHROPIC_API_KEY = os.environ.get("ANTHROPIC_API_KEY", "")
GOOGLE_API_KEY = os.environ.get("GOOGLE_API_KEY", "")

# --- Call OpenAI ---
response = requests.post(
    "https://api.openai.com/v1/chat/completions",
    headers={
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "Content-Type": "application/json"
    },
    json={
        "model": "gpt-4o-mini",
        "messages": [
            {"role": "system", "content": "You are a helpful Python tutor."},
            {"role": "user", "content": "Explain list comprehensions in one sentence."}
        ],
        "temperature": 0.7,
        "max_tokens": 150
    }
)
data = response.json()
print("OpenAI:", data["choices"][0]["message"]["content"])

# --- Call Claude ---
response = requests.post(
    "https://api.anthropic.com/v1/messages",
    headers={
        "x-api-key": ANTHROPIC_API_KEY,
        "anthropic-version": "2023-06-01",
        "Content-Type": "application/json"
    },
    json={
        "model": "claude-sonnet-4-20250514",
        "max_tokens": 150,
        "system": "You are a helpful Python tutor.",
        "messages": [
            {"role": "user", "content": "Explain list comprehensions in one sentence."}
        ]
    }
)
data = response.json()
print("Claude:", data["content"][0]["text"])

# --- Call Gemini ---
response = requests.post(
    f"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key={GOOGLE_API_KEY}",
    headers={"Content-Type": "application/json"},
    json={
        "system_instruction": {"parts": [{"text": "You are a helpful Python tutor."}]},
        "contents": [{"role": "user", "parts": [{"text": "Explain list comprehensions in one sentence."}]}],
        "generationConfig": {"temperature": 0.7, "maxOutputTokens": 150}
    }
)
data = response.json()
print("Gemini:", data["candidates"][0]["content"]["parts"][0]["text"])

print("\nAll three providers called successfully.")

FAQ

Can I use these APIs without paying?

Google Gemini has a generous free tier for testing. OpenAI and Anthropic give small signup credits. This whole tutorial costs under one cent.

# Check your OpenAI usage at any time:
# https://platform.openai.com/usage
# Check Anthropic: https://console.anthropic.com/settings/billing
# Check Google: https://aistudio.google.com/apikey

Which provider should I pick for my project?

Start with the cheapest model that meets your quality bar. For most tasks, GPT-4o-mini or Gemini Flash work great. Here’s a quick way to test all three on YOUR specific task:

my_task = "Summarize this paragraph in 2 sentences: [your text here]"
for p in ["openai", "claude", "gemini"]:
    print(f"{p}: {call_llm(my_task, provider=p)}\n")

Can I switch providers without rewriting my code?

That’s exactly what the call_llm wrapper does. For production, LiteLLM supports 100+ providers under one interface.

What’s the difference between temperature and max_tokens?

temperature controls randomness (0 = deterministic, 1+ = creative). max_tokens caps response length. They’re independent — you can have a short creative response or a long deterministic one.

Do I need the official SDKs?

Not for basic calls. We used raw requests here, which works everywhere. The official SDKs (openai, anthropic, google-generativeai) add convenience features: automatic retries, type hints, streaming helpers, and async support. They’re worth it for production code.

What’s Next?

You’ve called three LLM APIs from Python, compared their responses, and built a unified wrapper. That’s a solid foundation for any AI-powered application.

From here, four directions are worth exploring:

  • Function calling — let the LLM trigger your Python functions based on the conversation
  • Structured output — force the model to return JSON matching a specific schema
  • RAG (Retrieval-Augmented Generation) — feed the model your own documents for grounded answers
  • Building agents — combine LLM calls with tools to automate multi-step workflows

Each builds directly on the message format and roles you learned today.

References

  1. OpenAI API Reference — platform.openai.com/docs/api-reference
  2. Anthropic Claude API Reference — docs.anthropic.com/en/api
  3. Google Gemini API Reference — ai.google.dev/gemini-api/docs
  4. OpenAI Pricing — openai.com/api/pricing
  5. Anthropic Pricing — anthropic.com/pricing
  6. Google AI Studio — aistudio.google.com
  7. LiteLLM — Unified LLM API — github.com/BerriAI/litellm
  8. Python requests library docs — docs.python-requests.org

Last reviewed: March 2026 | Python 3.10+ | OpenAI API v1 | Claude API 2023-06-01 | Gemini API v1beta

Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Free Callback - Limited Slots
Not Sure Which Course to Start With?
Talk to our AI Counsellors and Practitioners. We'll help you clear all your questions for your background and goals, bridging the gap between your current skills and a career in AI.
10-digit mobile number
📞
Thank You!
We'll Call You Soon!
Our learning advisor will reach out within 24 hours.
(Check your inbox too — we've sent a confirmation)
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science