machine learning +
Zero-Shot vs Few-Shot Prompting: Complete Guide
OpenAI, Claude & Gemini API Tutorial in Python (2026)
Learn to call OpenAI, Claude, and Gemini APIs from Python in 15 minutes. Includes code examples, error handling, streaming, and a unified wrapper.
You’ve heard about GPT, Claude, and Gemini. But have you actually called one from your own Python script? It’s way easier than you think — and you’ll have all three running before your coffee gets cold.
This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.
What Is an LLM API?
Every time you type into ChatGPT, your browser sends an HTTP request to an API behind the scenes. The chat window? Just a wrapper. The real power sits behind it.
An API (Application Programming Interface) lets you talk to these models from code. You send a message, the model sends a reply. That’s the whole idea.
Why should you care? Because the API gives you control the chat window never will.
You pick the model. You set how creative or deterministic the output is. You define the system prompt. You parse the response however you want. You can process a thousand documents while you sleep.
Three providers dominate right now:
| Provider | Top Models | Known For |
|---|---|---|
| OpenAI | GPT-4o, GPT-4o-mini | Largest ecosystem, widest adoption |
| Anthropic | Claude Sonnet 4, Claude Haiku 3.5 | Precise instruction-following, long context |
| Gemini 2.0 Flash, Gemini 1.5 Pro | Multimodal tasks, generous free tier |
By the end of this tutorial, you’ll call all three from Python and compare their responses side-by-side.
Prerequisites
- Python version: 3.10+
- Required library:
requests(comes with most Python setups;pip install requestsif needed) - API keys: One from each provider (setup below)
- Time to complete: ~15 minutes
- Cost: Under $0.01 total
Get Your API Keys
Before any code runs, you need API keys. Each provider gives you one. It’s how they know who’s making the request and who to bill.
OpenAI:
1. Go to platform.openai.com/api-keys
2. Click “Create new secret key”
3. Copy it immediately — you won’t see it again
Anthropic (Claude):
1. Go to console.anthropic.com/settings/keys
2. Click “Create Key”
3. Copy and save it
Google (Gemini):
1. Go to aistudio.google.com/apikey
2. Click “Create API Key”
3. Select a project and copy the key
Free tier alert: Google gives a generous free tier for Gemini. OpenAI and Anthropic give small signup credits. This whole tutorial costs under $0.01.
Store your keys as environment variables. Never hardcode them in scripts you’ll share or commit.
The cleanest approach is a .env file with python-dotenv. Here’s the full setup:
import micropip
await micropip.install('requests')
# First: pip install python-dotenv
# Create a file called .env in your project folder:
# OPENAI_API_KEY=sk-...
# ANTHROPIC_API_KEY=sk-ant-...
# GOOGLE_API_KEY=AIza...
from dotenv import load_dotenv
import os
load_dotenv() # reads .env into environment variables
# Now these work:
print(os.environ.get("OPENAI_API_KEY", "")[:8] + "...")
python
sk-proj-...
If you’d rather skip the .env file for now, you can set keys directly in Python. It’s fine for learning — just don’t do it in production code.
import os os.environ["OPENAI_API_KEY"] = "your-key-here" os.environ["ANTHROPIC_API_KEY"] = "your-key-here" os.environ["GOOGLE_API_KEY"] = "your-key-here"
Never commit API keys to git. Add
.envto your.gitignorefile. Leaked keys get abused within minutes.
Your First LLM API Call — OpenAI
Here’s the core idea behind every LLM API call in Python. You send a list of messages with roles. The model reads them and responds.
There are three roles:
system— tells the model HOW to behave (personality, rules, constraints)user— that’s you, or your app’s userassistant— the model’s previous responses (used for multi-turn chats)
Think of it like a script for a play. Each message has a speaker and their line. The model reads the whole script and writes the next line.
We’ll use the requests library for all API calls. No SDKs needed. This approach runs anywhere — even in the browser with Pyodide.
The code below sends one user message with a system prompt to GPT-4o-mini. Pay attention to two things: the messages array structure and the Authorization header format.
import requests
import os
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", "")
response = requests.post(
"https://api.openai.com/v1/chat/completions",
headers={
"Authorization": f"Bearer {OPENAI_API_KEY}",
"Content-Type": "application/json"
},
json={
"model": "gpt-4o-mini",
"messages": [
{"role": "system", "content": "You are a helpful Python tutor."},
{"role": "user", "content": "Explain list comprehensions in one sentence."}
],
"temperature": 0.7,
"max_tokens": 150
}
)
data = response.json()
print(data["choices"][0]["message"]["content"])
python
A list comprehension is a concise way to create a new list by applying an expression to each item in an iterable, optionally filtering items with a condition, all in a single readable line.
That’s it. You just called GPT-4o-mini from Python. Your first LLM API request is done.
Two parameters to know right away:
temperaturecontrols randomness. Set it to 0 for deterministic output, 1.0 for creative output. I usually start at 0.7.max_tokenscaps the response length. Always set this — otherwise the model might generate thousands of tokens, and you pay for every one.
What the Response JSON Looks Like
The response has useful metadata beyond just the text. Let’s look at the full structure so you know what’s available.
import json print(json.dumps(data, indent=2))
json
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"model": "gpt-4o-mini",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "A list comprehension is a concise way to..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 28,
"completion_tokens": 42,
"total_tokens": 70
}
}
Three fields worth bookmarking:
choices[0].message.content— the actual response textchoices[0].finish_reason—"stop"means it finished naturally;"length"means it hit your max_tokens limitusage— token counts for billing (input + output)
You pay for every token, input AND output. A token is roughly 4 characters in English. The
usagefield tells you exactly how many tokens each request consumed.
Call the Claude API
Claude’s API follows the same concept, but Anthropic made a few design choices that’ll trip you up if you’re not ready for them. The biggest one: the system prompt lives outside the messages array.
Here are the four key differences from OpenAI you should watch for: the x-api-key header, the anthropic-version header, the top-level system field, and the response path (content[0].text instead of choices[0].message.content).
ANTHROPIC_API_KEY = os.environ.get("ANTHROPIC_API_KEY", "")
response = requests.post(
"https://api.anthropic.com/v1/messages",
headers={
"x-api-key": ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01",
"Content-Type": "application/json"
},
json={
"model": "claude-sonnet-4-20250514",
"max_tokens": 150,
"system": "You are a helpful Python tutor.",
"messages": [
{"role": "user", "content": "Explain list comprehensions in one sentence."}
]
}
)
data = response.json()
print(data["content"][0]["text"])
python
A list comprehension lets you build a new list by writing a for-loop and an optional if-filter inside square brackets, replacing what would otherwise take three or four lines of code with a single expressive line.
Here’s a quick summary of the differences:
| Feature | OpenAI | Claude |
|---|---|---|
| Auth header | Authorization: Bearer KEY | x-api-key: KEY |
| Version header | Not required | anthropic-version (required) |
| System prompt | Inside messages array | Top-level system field |
| Response path | choices[0].message.content | content[0].text |
| Token fields | total_tokens | input_tokens + output_tokens |
These differences are small, but they’ll bite you if you switch providers without checking the docs first.
Call the Gemini API
Google’s Gemini API has the most different structure of the three. The model name goes in the URL itself. Messages use parts instead of content.
Why parts? Because Gemini was built multimodal from the start. You can mix text, images, and audio in the same message. That parts array supports all of them. For plain text it feels verbose, but it shines when you start sending images.
Watch for these differences: API key in the URL (not headers), contents instead of messages, candidates instead of choices, and camelCase parameter names.
GOOGLE_API_KEY = os.environ.get("GOOGLE_API_KEY", "")
response = requests.post(
f"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key={GOOGLE_API_KEY}",
headers={"Content-Type": "application/json"},
json={
"system_instruction": {
"parts": [{"text": "You are a helpful Python tutor."}]
},
"contents": [
{
"role": "user",
"parts": [{"text": "Explain list comprehensions in one sentence."}]
}
],
"generationConfig": {
"temperature": 0.7,
"maxOutputTokens": 150
}
}
)
data = response.json()
print(data["candidates"][0]["content"]["parts"][0]["text"])
python
List comprehensions provide a compact syntax for creating lists by applying an expression to each element of an iterable, with an optional filtering condition, all written within square brackets on a single line.
You’ve now called all three providers. Same question, three slightly different answers. Each has its own response format, but the core idea is identical: send messages, get text back.
Compare All Three Side-by-Side
This is where things get interesting. Let’s send the same prompt to all three and compare responses, speed, and token usage in one shot.
I’ll build a helper function for each provider. Each wraps the API call, times it, and returns a consistent dictionary. That way you can loop through providers without juggling format differences.
Here’s the OpenAI helper. It grabs the API key, posts the request, and packages the result into a clean dictionary with provider, response, tokens, and latency:
import time
def call_openai(prompt, system="You are a helpful assistant."):
start = time.time()
resp = requests.post(
"https://api.openai.com/v1/chat/completions",
headers={
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY', '')}",
"Content-Type": "application/json"
},
json={
"model": "gpt-4o-mini",
"messages": [
{"role": "system", "content": system},
{"role": "user", "content": prompt}
],
"temperature": 0.7, "max_tokens": 200
}
)
elapsed = time.time() - start
data = resp.json()
return {
"provider": "OpenAI (gpt-4o-mini)",
"response": data["choices"][0]["message"]["content"],
"tokens": data["usage"]["total_tokens"],
"latency": round(elapsed, 2)
}
The Claude helper follows the same pattern but swaps in the Anthropic header format and response path:
def call_claude(prompt, system="You are a helpful assistant."):
start = time.time()
resp = requests.post(
"https://api.anthropic.com/v1/messages",
headers={
"x-api-key": os.environ.get("ANTHROPIC_API_KEY", ""),
"anthropic-version": "2023-06-01",
"Content-Type": "application/json"
},
json={
"model": "claude-sonnet-4-20250514",
"max_tokens": 200,
"system": system,
"messages": [{"role": "user", "content": prompt}]
}
)
elapsed = time.time() - start
data = resp.json()
return {
"provider": "Claude (claude-sonnet-4)",
"response": data["content"][0]["text"],
"tokens": data["usage"]["input_tokens"] + data["usage"]["output_tokens"],
"latency": round(elapsed, 2)
}
And the Gemini helper — note the API key in the URL and the nested parts structure:
def call_gemini(prompt, system="You are a helpful assistant."):
start = time.time()
resp = requests.post(
f"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key={os.environ.get('GOOGLE_API_KEY', '')}",
headers={"Content-Type": "application/json"},
json={
"system_instruction": {"parts": [{"text": system}]},
"contents": [{"role": "user", "parts": [{"text": prompt}]}],
"generationConfig": {"temperature": 0.7, "maxOutputTokens": 200}
}
)
elapsed = time.time() - start
data = resp.json()
return {
"provider": "Gemini (gemini-2.0-flash)",
"response": data["candidates"][0]["content"]["parts"][0]["text"],
"tokens": data.get("usageMetadata", {}).get("totalTokenCount", 0),
"latency": round(elapsed, 2)
}
With all three helpers ready, here’s the comparison. Same prompt, three providers, printed together:
prompt = "What are the top 3 tips for writing clean Python code? Be concise."
results = [call_openai(prompt), call_claude(prompt), call_gemini(prompt)]
for r in results:
print(f"\n{'='*60}")
print(f"Provider: {r['provider']}")
print(f"Latency: {r['latency']}s | Tokens: {r['tokens']}")
print(f"{'='*60}")
print(r["response"])
Your output will look something like this (exact responses and timings vary with each run):
python
============================================================
Provider: OpenAI (gpt-4o-mini)
Latency: 1.2s | Tokens: 90
============================================================
1. Use meaningful variable names that describe what they hold.
2. Follow PEP 8 style guidelines for consistent formatting.
3. Write small, focused functions that do one thing well.
============================================================
Provider: Claude (claude-sonnet-4)
Latency: 1.4s | Tokens: 95
============================================================
1. Use descriptive names -- variables, functions, and classes
should reveal their purpose without needing comments.
2. Keep functions short and focused -- each function should
do exactly one thing.
3. Follow PEP 8 -- consistent style makes code readable for
everyone, including future you.
============================================================
Provider: Gemini (gemini-2.0-flash)
Latency: 0.9s | Tokens: 80
============================================================
1. Use descriptive variable and function names.
2. Follow PEP 8 for consistent code style.
3. Keep functions small and focused on a single task.
Same prompt, three different styles. Claude tends to give more detail. Gemini Flash is often the fastest. OpenAI lands in the middle. Your exact latencies will vary depending on network conditions and server load.
No single provider is “best” for everything. OpenAI has the largest ecosystem. Claude excels at instruction-following and long documents. Gemini offers the best price-to-performance with native multimodal support. I usually start with the cheapest model and only upgrade when it falls short.
Exercise 1: Call All Three Providers
Task: Change the prompt to ask: “What is the difference between a list and a tuple in Python? Answer in exactly 2 sentences.”
Compare the three responses. Which provider followed “exactly 2 sentences” most precisely?
Handle Errors Like a Professional
API calls fail. Networks drop. Rate limits hit. Keys expire. If your code doesn’t handle these, it’ll crash at the worst possible time.
Here’s what production-quality error handling looks like for OpenAI. The key additions: check the API key before calling, set a timeout, and handle specific HTTP status codes with clear messages.
def safe_call_openai(prompt, system="You are a helpful assistant."):
api_key = os.environ.get("OPENAI_API_KEY", "")
if not api_key:
return {"error": "OPENAI_API_KEY not set"}
try:
resp = requests.post(
"https://api.openai.com/v1/chat/completions",
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
},
json={
"model": "gpt-4o-mini",
"messages": [
{"role": "system", "content": system},
{"role": "user", "content": prompt}
],
"max_tokens": 200
},
timeout=30
)
if resp.status_code == 401:
return {"error": "Invalid API key."}
elif resp.status_code == 429:
return {"error": "Rate limit hit. Wait and retry."}
elif resp.status_code != 200:
return {"error": f"HTTP {resp.status_code}: {resp.text[:200]}"}
data = resp.json()
return {
"content": data["choices"][0]["message"]["content"],
"tokens": data["usage"]["total_tokens"]
}
except requests.exceptions.Timeout:
return {"error": "Request timed out after 30s."}
except requests.exceptions.ConnectionError:
return {"error": "No internet connection."}
except Exception as e:
return {"error": f"Unexpected: {str(e)}"}
Test it like this:
result = safe_call_openai("Say hello in 5 words.")
if "error" in result:
print(f"Failed: {result['error']}")
else:
print(f"Response: {result['content']}")
print(f"Tokens used: {result['tokens']}")
python
Response: Hello there, how are you?
Tokens used: 35
The three most common errors you’ll hit:
- 401 Unauthorized — wrong or expired API key
- 429 Rate Limited — too many requests too fast
- 500/503 Server Error — provider is having issues (retry after a short wait)
Always set a
timeout. Without it, a hung connection blocks your code forever. I’ve seen scripts freeze for hours because someone forgot this one parameter.
Exercise 2: Add Error Handling for Claude
Task: Write a safe_call_claude function that mirrors safe_call_openai. Handle the same error cases: missing key, auth error, rate limit, and timeout.
Build a Unified LLM API Wrapper
Remembering three different API formats gets old fast. What if you could call any provider with one function?
That’s exactly what we’ll build. Pass the provider name as a string, and the function picks the right endpoint, headers, and response path for you. No more copy-pasting boilerplate.
def call_llm(prompt, provider="openai", system="You are a helpful assistant.", max_tokens=200):
"""Call any LLM provider with one consistent interface."""
if provider == "openai":
resp = requests.post(
"https://api.openai.com/v1/chat/completions",
headers={
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY', '')}",
"Content-Type": "application/json"
},
json={
"model": "gpt-4o-mini",
"messages": [
{"role": "system", "content": system},
{"role": "user", "content": prompt}
],
"max_tokens": max_tokens
},
timeout=30
)
return resp.json()["choices"][0]["message"]["content"]
elif provider == "claude":
resp = requests.post(
"https://api.anthropic.com/v1/messages",
headers={
"x-api-key": os.environ.get("ANTHROPIC_API_KEY", ""),
"anthropic-version": "2023-06-01",
"Content-Type": "application/json"
},
json={
"model": "claude-sonnet-4-20250514",
"max_tokens": max_tokens,
"system": system,
"messages": [{"role": "user", "content": prompt}]
},
timeout=30
)
return resp.json()["content"][0]["text"]
elif provider == "gemini":
key = os.environ.get("GOOGLE_API_KEY", "")
resp = requests.post(
f"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key={key}",
headers={"Content-Type": "application/json"},
json={
"system_instruction": {"parts": [{"text": system}]},
"contents": [{"role": "user", "parts": [{"text": prompt}]}],
"generationConfig": {"maxOutputTokens": max_tokens}
},
timeout=30
)
return resp.json()["candidates"][0]["content"]["parts"][0]["text"]
else:
raise ValueError(f"Unknown provider: {provider}")
Three providers, one function call. Here it is in action:
for provider in ["openai", "claude", "gemini"]:
answer = call_llm("What is Python's GIL in one sentence?", provider=provider)
print(f"{provider:>8}: {answer}")
python
openai: The GIL (Global Interpreter Lock) is a mutex that allows only one thread to execute Python bytecode at a time, limiting true parallelism in multi-threaded programs.
claude: Python's GIL (Global Interpreter Lock) is a mutex that prevents multiple native threads from executing Python bytecodes simultaneously, effectively making CPU-bound multi-threaded programs run on a single core.
gemini: The GIL is a mutex in CPython that allows only one thread to hold control of the Python interpreter at a time, limiting multi-threaded CPU-bound performance.
You could extend this to support Ollama (local models), Groq (fast inference), or any other provider. For production use, libraries like LiteLLM take this further with 100+ providers under one interface.
Multi-Turn Conversations
So far we’ve sent single messages. But real chatbots need memory — they need to know what was said before.
Here’s the trick: you maintain the conversation yourself. Every time the model responds, you append its reply to the messages list. Then you send the whole list with your next message. The model reads the full history and responds in context.
This works identically across all three providers. Here’s how it looks with OpenAI:
conversation = [
{"role": "system", "content": "You are a helpful Python tutor."},
{"role": "user", "content": "What does enumerate() do?"}
]
# First turn
resp = requests.post(
"https://api.openai.com/v1/chat/completions",
headers={
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY', '')}",
"Content-Type": "application/json"
},
json={"model": "gpt-4o-mini", "messages": conversation, "max_tokens": 150}
)
assistant_reply = resp.json()["choices"][0]["message"]["content"]
print("Assistant:", assistant_reply)
# Append the reply, then ask a follow-up
conversation.append({"role": "assistant", "content": assistant_reply})
conversation.append({"role": "user", "content": "Show me an example with a list of fruits."})
# Second turn -- model remembers the context
resp = requests.post(
"https://api.openai.com/v1/chat/completions",
headers={
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY', '')}",
"Content-Type": "application/json"
},
json={"model": "gpt-4o-mini", "messages": conversation, "max_tokens": 200}
)
print("Assistant:", resp.json()["choices"][0]["message"]["content"])
The model sees the full conversation history and responds accordingly. It knows you were talking about enumerate() and gives a fruit-based example.
Watch your token costs. Every turn sends the ENTIRE conversation history. A 20-message chat sends all 20 messages every time. Long conversations get expensive. In production, you’d trim older messages or summarize them.
For Claude, the only difference is that the system prompt goes in the top-level system field instead of the messages array. For Gemini, swap messages for contents and use the parts structure. The conversation pattern stays the same.
Streaming Responses
By default, you wait for the model to finish its entire response before seeing anything. Streaming changes that — you get tokens as they’re generated, one chunk at a time. It’s what makes ChatGPT feel responsive.
Here’s streaming with OpenAI. The key change: add "stream": True and read the response line by line instead of calling .json():
resp = requests.post(
"https://api.openai.com/v1/chat/completions",
headers={
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY', '')}",
"Content-Type": "application/json"
},
json={
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": "Write a haiku about Python."}],
"stream": True
},
stream=True
)
for line in resp.iter_lines():
if line:
text = line.decode("utf-8")
if text.startswith("data: ") and text != "data: [DONE]":
chunk = json.loads(text[6:])
delta = chunk["choices"][0]["delta"]
if "content" in delta:
print(delta["content"], end="", flush=True)
print() # newline at the end
For Claude, streaming uses Server-Sent Events with different event types. Here’s the pattern:
resp = requests.post(
"https://api.anthropic.com/v1/messages",
headers={
"x-api-key": os.environ.get("ANTHROPIC_API_KEY", ""),
"anthropic-version": "2023-06-01",
"Content-Type": "application/json"
},
json={
"model": "claude-sonnet-4-20250514",
"max_tokens": 150,
"messages": [{"role": "user", "content": "Write a haiku about Python."}],
"stream": True
},
stream=True
)
for line in resp.iter_lines():
if line:
text = line.decode("utf-8")
if text.startswith("data: "):
data = json.loads(text[6:])
if data["type"] == "content_block_delta":
print(data["delta"]["text"], end="", flush=True)
print()
When should you stream? Always stream in user-facing apps. Nobody likes staring at a blank screen for 3 seconds. For batch processing (no human waiting), skip streaming — it adds code complexity for no benefit.
When to Use Which Provider
After working with all three, here’s how I think about choosing:
| Use Case | Best Pick | Why |
|---|---|---|
| General tasks, wide compatibility | OpenAI GPT-4o-mini | Largest ecosystem, most tutorials, cheapest capable model |
| Strict instruction-following | Claude Sonnet 4 | Follows complex prompts most reliably |
| Budget-sensitive projects | Gemini 2.0 Flash | Cheapest per token, generous free tier |
| Long documents (100K+ tokens) | Claude or Gemini | Both handle very long contexts well |
| Image + text input | Gemini 2.0 Flash | Built multimodal from the start |
| Code generation | Claude Sonnet 4 | Strong at writing and debugging code |
Don’t overthink this choice. Start with the cheapest model that handles your task. Test with 10-20 real examples. Switch only if quality falls short.
API Pricing — What Does Each Call Cost?
Understanding pricing prevents bill shock. Here’s what these models cost as of early 2026:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Notes |
|---|---|---|---|
| GPT-4o-mini | $0.15 | $0.60 | Best value for most tasks |
| Claude Sonnet 4 | $3.00 | $15.00 | Strong at instructions |
| Gemini 2.0 Flash | $0.10 | $0.40 | Cheapest, generous free tier |
| GPT-4o | $2.50 | $10.00 | Most capable OpenAI model |
| Claude Haiku 3.5 | $0.80 | $4.00 | Fast and affordable |
To put this in perspective: 1 million tokens is roughly 750,000 words. That’s about 10 novels. Our tutorial prompts used fewer than 100 tokens each. You’d need thousands of calls before spending a dollar.
Start with the cheapest model that works. GPT-4o-mini and Gemini Flash handle most tasks well. Only upgrade to bigger models when you’ve confirmed the cheaper option isn’t good enough for your specific task.
Exercise 3: Add Cost Estimation
Task: Write a function call_llm_with_cost that wraps call_llm and returns a dictionary with response, latency, estimated_tokens, and estimated_cost.
Use this pricing (averaged input+output per million tokens): GPT-4o-mini: $0.375, Claude Sonnet 4: $9.00, Gemini Flash: $0.25.
Quick Reference — LLM API Cheat Sheet
| Feature | OpenAI | Claude | Gemini |
|---|---|---|---|
| Endpoint | /v1/chat/completions | /v1/messages | /v1beta/models/{model}:generateContent |
| Auth | Authorization: Bearer KEY | x-api-key: KEY | ?key=KEY in URL |
| System prompt | In messages array | Top-level system field | system_instruction object |
| User message | {"role": "user", "content": "..."} | Same | {"role": "user", "parts": [{"text": "..."}]} |
| Response text | choices[0].message.content | content[0].text | candidates[0].content.parts[0].text |
| Token usage | usage.total_tokens | usage.input_tokens + usage.output_tokens | usageMetadata.totalTokenCount |
| Streaming | "stream": true | "stream": true | Not via REST (use SDK) |
| Temperature | 0–2 | 0–1 | 0–2 |
| Max tokens | max_tokens | max_tokens | maxOutputTokens |
Bookmark this table. You’ll come back to it every time you switch between providers.
Complete Code
FAQ
Can I use these APIs without paying?
Google Gemini has a generous free tier for testing. OpenAI and Anthropic give small signup credits. This whole tutorial costs under one cent.
# Check your OpenAI usage at any time: # https://platform.openai.com/usage # Check Anthropic: https://console.anthropic.com/settings/billing # Check Google: https://aistudio.google.com/apikey
Which provider should I pick for my project?
Start with the cheapest model that meets your quality bar. For most tasks, GPT-4o-mini or Gemini Flash work great. Here’s a quick way to test all three on YOUR specific task:
my_task = "Summarize this paragraph in 2 sentences: [your text here]"
for p in ["openai", "claude", "gemini"]:
print(f"{p}: {call_llm(my_task, provider=p)}\n")
Can I switch providers without rewriting my code?
That’s exactly what the call_llm wrapper does. For production, LiteLLM supports 100+ providers under one interface.
What’s the difference between temperature and max_tokens?
temperature controls randomness (0 = deterministic, 1+ = creative). max_tokens caps response length. They’re independent — you can have a short creative response or a long deterministic one.
Do I need the official SDKs?
Not for basic calls. We used raw requests here, which works everywhere. The official SDKs (openai, anthropic, google-generativeai) add convenience features: automatic retries, type hints, streaming helpers, and async support. They’re worth it for production code.
What’s Next?
You’ve called three LLM APIs from Python, compared their responses, and built a unified wrapper. That’s a solid foundation for any AI-powered application.
From here, four directions are worth exploring:
- Function calling — let the LLM trigger your Python functions based on the conversation
- Structured output — force the model to return JSON matching a specific schema
- RAG (Retrieval-Augmented Generation) — feed the model your own documents for grounded answers
- Building agents — combine LLM calls with tools to automate multi-step workflows
Each builds directly on the message format and roles you learned today.
References
- OpenAI API Reference — platform.openai.com/docs/api-reference
- Anthropic Claude API Reference — docs.anthropic.com/en/api
- Google Gemini API Reference — ai.google.dev/gemini-api/docs
- OpenAI Pricing — openai.com/api/pricing
- Anthropic Pricing — anthropic.com/pricing
- Google AI Studio — aistudio.google.com
- LiteLLM — Unified LLM API — github.com/BerriAI/litellm
- Python
requestslibrary docs — docs.python-requests.org
Last reviewed: March 2026 | Python 3.10+ | OpenAI API v1 | Claude API 2023-06-01 | Gemini API v1beta
Free Course
Master Core Python — Your First Step into AI/ML
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
