Claude API Tutorial: Messages, Tools & Streaming
Master the Claude API with raw HTTP — messages, streaming, tool use, extended thinking, and prompt caching with runnable Python code examples.
Send messages, stream responses, call tools, and enable extended thinking — all with raw HTTP requests to the Claude API.
This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.
You want to add Claude to your Python project. You open the docs and find a Python SDK. But what’s actually happening under the hood? What headers does the request need? What does the JSON look like?
This tutorial skips the SDK. You’ll build raw HTTP request bodies so you see exactly what goes over the wire. By the end, you’ll know how to send messages, stream responses, use tools, enable extended thinking, and cache prompts — all with code that runs in your browser.
What Is the Claude Messages API?
The Messages API is Claude’s single endpoint. Text, tool calls, thinking, streaming — everything goes through one URL:
https://api.anthropic.com/v1/messages
Every request needs three headers. Here’s what they look like.
import micropip
await micropip.install('requests')
import json
headers = {
"x-api-key": "sk-ant-your-key-here",
"anthropic-version": "2023-06-01",
"content-type": "application/json"
}
print("Required headers for every Claude API call:")
for key, value in headers.items():
print(f" {key}: {value}")
Required headers for every Claude API call:
x-api-key: sk-ant-your-key-here
anthropic-version: 2023-06-01
content-type: application/json
The x-api-key authenticates you. anthropic-version pins behavior to a stable release — 2023-06-01 is current as of 2026. content-type is always JSON.
Getting Your API Key
Go to console.anthropic.com and create a key under Settings. Copy it right away — Anthropic won’t show it again.
Store it as an environment variable. Never put keys in code.
# macOS / Linux
export ANTHROPIC_API_KEY="sk-ant-your-key-here"
# Windows PowerShell
$env:ANTHROPIC_API_KEY = "sk-ant-your-key-here"
Which Model Should You Pick?
Claude comes in several sizes. Here’s a quick guide.
| Model | Best For | Speed | Cost |
|---|---|---|---|
| Claude Opus 4 | Complex reasoning, research | Slowest | Highest |
| Claude Sonnet 4.5 | Balanced quality and speed | Medium | Medium |
| Claude Haiku 4.5 | Fast, simple tasks | Fastest | Lowest |
We use claude-sonnet-4-5-20250514 in this tutorial. It handles chat, tools, and thinking well.
“`python
Real HTTP call (not runnable in Pyodide)
import requests
resp = requests.post(
“https://api.anthropic.com/v1/messages”,
headers=headers,
json=request_body
)
data = resp.json()
<div class="callout callout-key-insight"><strong>Key Insight:</strong> <strong>One endpoint handles everything.</strong> Messages, streaming, tools, thinking, caching — the URL and headers stay the same. Only the request body changes.</div>
---
## How Does the Message Format Work? {#message-format}
Here's the biggest difference from OpenAI: the system prompt sits outside the messages array. It's a top-level field.
Why? The system prompt guides every response. It's not a turn — it's context. Keeping it separate makes that role clear.
```python
request_body = {
"model": "claude-sonnet-4-5-20250514",
"max_tokens": 1024,
"system": "You are a data science tutor. Keep answers under 3 sentences.",
"messages": [
{"role": "user", "content": "What is gradient descent in one sentence?"}
]
}
print(json.dumps(request_body, indent=2))
{
"model": "claude-sonnet-4-5-20250514",
"max_tokens": 1024,
"system": "You are a data science tutor. Keep answers under 3 sentences.",
"messages": [
{
"role": "user",
"content": "What is gradient descent in one sentence?"
}
]
}
See how system is at the same level as model? In OpenAI, you’d put {"role": "system", "content": "..."} inside the messages list. Claude’s design is cleaner.
Quick check: What happens if you add {"role": "system"} to the messages array? A 400 error. Claude only allows "user" and "assistant" as message roles.
Multi-Turn Conversations
Conversations alternate between user and assistant. You send the full history each time. Claude doesn’t remember past calls.
multi_turn = {
"model": "claude-sonnet-4-5-20250514",
"max_tokens": 1024,
"system": "You are a data science tutor.",
"messages": [
{"role": "user", "content": "What is overfitting?"},
{"role": "assistant", "content": "Overfitting is when a model learns noise instead of the real pattern."},
{"role": "user", "content": "How do I detect it?"}
]
}
print(f"Turns: {len(multi_turn['messages'])}")
print(f"Follow-up: {multi_turn['messages'][-1]['content']}")
Turns: 3
Follow-up: How do I detect it?
Drop a message and Claude loses context. Costs grow each turn because the full history travels every time.
How Do You Read the Response?
Claude returns JSON. The key field is content — an array of blocks. For plain text, you get one block with type: "text".
response = {
"id": "msg_01XFDUDYJgAACzvnptvVoYEL",
"type": "message",
"role": "assistant",
"model": "claude-sonnet-4-5-20250514",
"content": [
{
"type": "text",
"text": "Gradient descent adjusts parameters by moving toward lower loss."
}
],
"stop_reason": "end_turn",
"usage": {"input_tokens": 42, "output_tokens": 14}
}
answer = response["content"][0]["text"]
print(f"Answer: {answer}")
print(f"Tokens: {response['usage']['input_tokens']} in, {response['usage']['output_tokens']} out")
print(f"Stopped: {response['stop_reason']}")
Answer: Gradient descent adjusts parameters by moving toward lower loss.
Tokens: 42 in, 14 out
Stopped: end_turn
Why is content an array? Because Claude can return multiple blocks. Text, tool calls, and thinking all come back as separate blocks in the same array.
Three stop reasons to know:
stop_reason |
Meaning | What to do |
|---|---|---|
end_turn |
Claude finished | Show the answer |
max_tokens |
Hit your limit | Response cut off — raise max_tokens |
tool_use |
Wants to call a tool | Run it and feed the result back |
How Does Streaming Work?
Without streaming, you wait for the full answer. The user sees nothing. With streaming, text arrives chunk by chunk.
Set "stream": true in the request. Claude sends Server-Sent Events (SSE) instead of one JSON blob.
stream_req = {
"model": "claude-sonnet-4-5-20250514",
"max_tokens": 1024,
"stream": True,
"messages": [
{"role": "user", "content": "Explain backpropagation in 2 sentences."}
]
}
print(f"Streaming: {stream_req['stream']}")
print("Response: SSE events instead of a single JSON object")
Streaming: True
Response: SSE events instead of a single JSON object
What Do SSE Events Look Like?
Events arrive in order. message_start first. Then content_block_start. content_block_delta events carry text, one piece at a time. Join them to build the answer.
events = [
{"type": "message_start", "message": {"id": "msg_01...", "model": "claude-sonnet-4-5-20250514"}},
{"type": "content_block_start", "index": 0, "content_block": {"type": "text", "text": ""}},
{"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": "Back"}},
{"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": "propagation"}},
{"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": " computes"}},
{"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": " the gradient"}},
{"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": " of the loss."}},
{"type": "content_block_stop", "index": 0},
{"type": "message_delta", "delta": {"stop_reason": "end_turn"}, "usage": {"output_tokens": 12}},
{"type": "message_stop"}
]
full_text = ""
for event in events:
if event["type"] == "content_block_delta":
chunk = event["delta"]["text"]
full_text += chunk
print(f" chunk: '{chunk}'")
print(f"\nFull: {full_text}")
chunk: 'Back'
chunk: 'propagation'
chunk: ' computes'
chunk: ' the gradient'
chunk: ' of the loss.'
Full: Backpropagation computes the gradient of the loss.
In a chat app, push each chunk to the screen. The user sees words appear — same feel as ChatGPT.
Exercise 1: Parse a Streaming Response
Write a function that takes SSE events and returns the full text, stop reason, and output token count.
What Is Tool Use?
Claude can’t browse the web or run code on its own. But you can give it tools — functions it asks you to call. Claude picks the right tool, tells you the arguments, and you send back the result.
Three steps every time:
1. Define tools in the request
2. Handle tool_use blocks in the response
3. Feed results back as tool_result messages
Step 1: Define Your Tools
Each tool has a name, a description, and an input_schema. Claude reads the description to decide when to use the tool.
weather_tool = {
"name": "get_weather",
"description": "Get current weather for a city. Use when the user asks about weather.",
"input_schema": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name, e.g. 'Tokyo'"},
"units": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature units. Defaults to celsius."
}
},
"required": ["city"]
}
}
print(f"Tool: {weather_tool['name']}")
print(f"Required: {weather_tool['input_schema']['required']}")
print(f"Optional: ['units']")
Tool: get_weather
Required: ['city']
Optional: ['units']
Write clear descriptions. “Get current weather for a city” works well. Something vague like “do weather stuff” might confuse Claude into skipping the tool.
Step 2: Handle the tool_use Response
When Claude wants a tool, the response changes. stop_reason is "tool_use". A tool_use block appears in content with the tool name, unique ID, and arguments.
tool_resp = {
"role": "assistant",
"content": [
{"type": "text", "text": "I'll check the weather in Paris."},
{
"type": "tool_use",
"id": "toolu_01A09q90qw90lq917835lhm",
"name": "get_weather",
"input": {"city": "Paris", "units": "celsius"}
}
],
"stop_reason": "tool_use"
}
for block in tool_resp["content"]:
if block["type"] == "text":
print(f"Claude: {block['text']}")
elif block["type"] == "tool_use":
print(f"Tool: {block['name']} | ID: {block['id']}")
print(f"Args: {json.dumps(block['input'])}")
Claude: I'll check the weather in Paris.
Tool: get_weather | ID: toolu_01A09q90qw90lq917835lhm
Args: {"city": "Paris", "units": "celsius"}
Step 3: Feed the Result Back
You run the function yourself. Then wrap the output in a tool_result message. The tool_use_id must match the id Claude gave you.
def get_weather(city, units="celsius"):
data = {
"Paris": {"temp": 22, "condition": "Partly cloudy", "humidity": 65},
"London": {"temp": 15, "condition": "Rainy", "humidity": 80},
}
return json.dumps(data.get(city, {"temp": 20, "condition": "Unknown"}))
tool_block = tool_resp["content"][1]
result = get_weather(**tool_block["input"])
print(f"Result: {result}")
msg = {
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": tool_block["id"],
"content": result
}]
}
print(f"IDs match: {msg['content'][0]['tool_use_id'] == tool_block['id']}")
Result: {"temp": 22, "condition": "Partly cloudy", "humidity": 65}
IDs match: True
Claude gets this result and writes something like: “It’s 22 degrees and partly cloudy in Paris right now.”
Exercise 2: Build a Multi-Tool Request
Create a request body with two tools: a temperature converter and a BMI calculator. Use JSON Schema for both.
How Does Extended Thinking Work?
Some questions need more reasoning. Math, multi-step logic, code debugging — Claude does better when it thinks before answering.
Add a thinking object to the request. budget_tokens sets the max reasoning tokens. This is separate from max_tokens.
thinking_req = {
"model": "claude-sonnet-4-5-20250514",
"max_tokens": 8000,
"thinking": {
"type": "enabled",
"budget_tokens": 5000
},
"messages": [
{"role": "user", "content": "What is 847 * 293? Show your work."}
]
}
print(f"Answer limit: {thinking_req['max_tokens']} tokens")
print(f"Thinking budget: {thinking_req['thinking']['budget_tokens']} tokens")
Answer limit: 8000 tokens
Thinking budget: 5000 tokens
What Comes Back?
The response has a thinking block, then a text block. The thinking block shows Claude’s reasoning. The text block is the clean answer.
thinking_resp = {
"content": [
{
"type": "thinking",
"thinking": "847 * 293\n= 847 * 300 - 847 * 7\n= 254100 - 5929\n= 248171",
"signature": "WaUjzkypQ2mUEVM36O..."
},
{
"type": "text",
"text": "847 x 293 = 248,171\n\n- 847 x 300 = 254,100\n- 847 x 7 = 5,929\n- 254,100 - 5,929 = 248,171"
}
]
}
for block in thinking_resp["content"]:
if block["type"] == "thinking":
print("--- THINKING ---")
print(block["thinking"])
elif block["type"] == "text":
print("\n--- ANSWER ---")
print(block["text"])
--- THINKING ---
847 * 293
= 847 * 300 - 847 * 7
= 254100 - 5929
= 248171
--- ANSWER ---
847 x 293 = 248,171
- 847 x 300 = 254,100
- 847 x 7 = 5,929
- 254,100 - 5,929 = 248,171
In a chat app, show the answer. Offer a “Show reasoning” toggle for the thinking block.
The signature field matters. If you continue the conversation, pass the thinking block back as-is. Claude verifies it hasn’t been changed.
Thinking Display Modes
You can control what the thinking block contains.
Summarized (default) — a short version of the reasoning. Still billed for the full thinking tokens.
Omitted — empty thinking field, just the signature. Faster, since no thinking text streams. Still billed.
omitted = {
"model": "claude-sonnet-4-5-20250514",
"max_tokens": 8000,
"thinking": {
"type": "enabled",
"budget_tokens": 5000,
"display": "omitted"
},
"messages": [{"role": "user", "content": "What is 27 * 453?"}]
}
print(f"Display: {omitted['thinking']['display']}")
print("Result: empty thinking field, faster time-to-first-text")
Display: omitted
Result: empty thinking field, faster time-to-first-text
Thinking + Tool Use Together
Claude can think and then call tools. The key rule: pass thinking blocks back unchanged when you send tool results.
think_tool = {
"content": [
{"type": "thinking", "thinking": "User wants weather. I'll use get_weather.", "signature": "abc..."},
{"type": "tool_use", "id": "toolu_t01", "name": "get_weather", "input": {"city": "Tokyo"}}
],
"stop_reason": "tool_use"
}
print("Blocks to include in your assistant message:")
for b in think_tool["content"]:
print(f" {b['type']}")
print("Keep both — don't strip the thinking block!")
Blocks to include in your assistant message:
thinking
tool_use
Keep both — don't strip the thinking block!
How Does Prompt Caching Cut Costs?
Every call re-reads your system prompt. If you send 2,000 tokens each time, you pay each time.
Caching stores that content on the server. Cached tokens cost 90% less. First call pays a small extra to create the cache. Every call after that saves.
Add cache_control to any content block you want cached.
cached = {
"model": "claude-sonnet-4-5-20250514",
"max_tokens": 1024,
"system": [
{
"type": "text",
"text": "You are a Python tutor. Use f-strings. Follow PEP 8.",
"cache_control": {"type": "ephemeral"}
}
],
"messages": [{"role": "user", "content": "How do I reverse a list?"}]
}
print(f"System: {len(cached['system'][0]['text'])} chars")
print(f"Cache: {cached['system'][0]['cache_control']}")
print("Note: system is now an ARRAY, not a string")
System: 52 chars
Cache: {'type': 'ephemeral'}
Note: system is now an ARRAY, not a string
The system field changed from a string to an array. You need the block structure to attach cache_control.
How to Read Cache Metrics
The usage object shows what happened with the cache.
call_1 = {"cache_creation_input_tokens": 62, "cache_read_input_tokens": 0}
call_2 = {"cache_creation_input_tokens": 0, "cache_read_input_tokens": 62}
print("Call 1: cache created —", call_1["cache_creation_input_tokens"], "tokens stored")
print("Call 2: cache hit —", call_2["cache_read_input_tokens"], "tokens read (90% cheaper)")
Call 1: cache created — 62 tokens stored
Call 2: cache hit — 62 tokens read (90% cheaper)
For apps with big system prompts and many calls, the savings stack up fast.
How Do You Handle Errors?
Things break. Rate limits, bad payloads, server hiccups. Here’s the error table.
| Status | Meaning | Action |
|---|---|---|
| 400 | Bad request | Fix your JSON body |
| 401 | Bad API key | Check your key |
| 429 | Rate limited | Wait and retry |
| 500 | Server error | Retry shortly |
| 529 | Overloaded | Wait longer |
The 429 is most common. Use exponential backoff: wait 1s, 2s, 4s, 8s.
def show_backoff(retries=4):
for attempt in range(retries):
wait = 2 ** attempt
print(f" Attempt {attempt + 1}: wait {wait}s")
print("Exponential backoff for 429:")
show_backoff()
Exponential backoff for 429:
Attempt 1: wait 1s
Attempt 2: wait 2s
Attempt 3: wait 4s
Attempt 4: wait 8s
The anthropic SDK handles retries for you. With raw HTTP, you build it yourself.
How Do You Put It All Together?
Let’s build an assistant with two tools: a calculator and a concept lookup.
calculator_tool = {
"name": "calculate",
"description": "Evaluate a math expression.",
"input_schema": {
"type": "object",
"properties": {
"expression": {"type": "string", "description": "Python math, e.g. '2**10'"}
},
"required": ["expression"]
}
}
lookup_tool = {
"name": "lookup_concept",
"description": "Look up a data science concept.",
"input_schema": {
"type": "object",
"properties": {
"concept": {"type": "string", "description": "Concept name"}
},
"required": ["concept"]
}
}
tools = [calculator_tool, lookup_tool]
print(f"Tools: {[t['name'] for t in tools]}")
Tools: ['calculate', 'lookup_concept']
The tool runner does the work and returns JSON.
def run_tool(name, args):
if name == "calculate":
try:
result = eval(args["expression"], {"__builtins__": {}})
return json.dumps({"result": result})
except Exception as e:
return json.dumps({"error": str(e)})
elif name == "lookup_concept":
db = {
"random forest": "Ensemble of decision trees on random subsets.",
"gradient descent": "Move parameters toward lower loss.",
"overfitting": "Model memorizes noise instead of patterns."
}
return json.dumps({"definition": db.get(args["concept"].lower(), "Not found.")})
return json.dumps({"error": f"Unknown tool: {name}"})
print(run_tool("calculate", {"expression": "2**10 + 42"}))
print(run_tool("lookup_concept", {"concept": "random forest"}))
{"result": 1066}
{"definition": "Ensemble of decision trees on random subsets."}
The Full Tool Loop
Check for tool_use blocks, run each tool, and build result messages.
def process_response(content):
texts, results = [], []
for block in content:
if block["type"] == "text":
texts.append(block["text"])
elif block["type"] == "tool_use":
r = run_tool(block["name"], block["input"])
results.append({
"type": "tool_result",
"tool_use_id": block["id"],
"content": r
})
return texts, results
sim = [
{"type": "text", "text": "Let me calculate that."},
{"type": "tool_use", "id": "toolu_001", "name": "calculate", "input": {"expression": "847 * 293"}}
]
texts, results = process_response(sim)
print(f"Claude: {texts[0]}")
print(f"Result: {results[0]['content']}")
Claude: Let me calculate that.
Result: {"result": 248171}
Keep sending results until stop_reason is "end_turn". Claude may call several tools before the final answer.
Exercise 3: Handle a Tool-Use Response
Given a response with a tool_use block, extract the tool name, run it, and build a tool_result with the correct ID.
Common Mistakes and Fixes
Mistake 1: Missing anthropic-version
bad = {"x-api-key": "key", "content-type": "application/json"}
print(f"Headers: {list(bad.keys())}")
print("API returns 400 — 'anthropic-version' required on every call")
Headers: ['x-api-key', 'content-type']
API returns 400 — 'anthropic-version' required on every call
Fix: Send all three headers, always.
Mistake 2: System Prompt in Messages
wrong = [{"role": "system", "content": "Be helpful."}, {"role": "user", "content": "Hi!"}]
print(f"Role: '{wrong[0]['role']}' — not valid in messages")
print("Fix: use the top-level 'system' field")
Role: 'system' — not valid in messages
Fix: use the top-level 'system' field
Mistake 3: Wrong tool_use_id
print(f"Claude's ID: toolu_abc123")
print(f"Your ID: toolu_WRONG")
print(f"Match: False — API returns 400")
print("Fix: copy the id from Claude's tool_use block")
Claude's ID: toolu_abc123
Your ID: toolu_WRONG
Match: False — API returns 400
Fix: copy the id from Claude's tool_use block
When Should You Skip Raw HTTP?
Raw HTTP gives full control. But it adds work.
Use the anthropic SDK for retries, typed objects, and SSE parsing.
Use LangChain when you chain calls, swap providers, or need memory.
Stick with raw HTTP for learning, debugging, minimal deps, or Pyodide.
Summary
Everything you learned:
- Messages — system prompt separate, history in the messages array
- Streaming —
"stream": true, jointext_deltachunks from SSE events - Tool use — define, handle
tool_useblocks, feed results with matching IDs - Extended thinking —
budget_tokensfor reasoning, pass blocks back - Prompt caching —
cache_controlsaves 90% on repeated content - Errors — exponential backoff for 429, check status codes
One endpoint. Same headers. Different bodies. That’s the pattern.
Practice Exercise
Build a function that handles up to 3 tool rounds with error handling for unknown tools.
Frequently Asked Questions
What’s the difference between max_tokens and budget_tokens?
max_tokens caps the visible answer. budget_tokens caps thinking. Think of it as: max_tokens = what the user sees, budget_tokens = Claude’s scratch paper. Budget must be smaller.
Can Claude call multiple tools at once?
Yes. Several tool_use blocks can appear in one response. Run them all, send results back together. Each needs its own tool_use_id.
How long does the prompt cache last?
About 5 minutes by default. Up to 1 hour for long tasks. Any change to cached content breaks the cache.
Does streaming work with tools and thinking?
Yes. Tools come as SSE tool_use blocks. Thinking comes as thinking_delta events. Same structure you learned.
Complete Code
References
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →