Claude API Tutorial: Messages, Tools & Streaming

Master the Claude API with raw HTTP — messages, streaming, tool use, extended thinking, and prompt caching with runnable Python code examples.

Written by Selva Prabhakaran | 24 min read

Send messages, stream responses, call tools, and enable extended thinking — all with raw HTTP requests to the Claude API.

This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.

You want to add Claude to your Python project. You open the docs and find a Python SDK. But what’s actually happening under the hood? What headers does the request need? What does the JSON look like?

This tutorial skips the SDK. You’ll build raw HTTP request bodies so you see exactly what goes over the wire. By the end, you’ll know how to send messages, stream responses, use tools, enable extended thinking, and cache prompts — all with code that runs in your browser.

What Is the Claude Messages API?

The Messages API is Claude’s single endpoint. Text, tool calls, thinking, streaming — everything goes through one URL:

python

https://api.anthropic.com/v1/messages

Every request needs three headers. Here’s what they look like.

import micropip
await micropip.install('requests')

import json

headers = {
    "x-api-key": "sk-ant-your-key-here",
    "anthropic-version": "2023-06-01",
    "content-type": "application/json"
}

print("Required headers for every Claude API call:")
for key, value in headers.items():
    print(f"  {key}: {value}")

python

Required headers for every Claude API call:
  x-api-key: sk-ant-your-key-here
  anthropic-version: 2023-06-01
  content-type: application/json

The x-api-key authenticates you. anthropic-version pins behavior to a stable release — 2023-06-01 is current as of 2026. content-type is always JSON.

Getting Your API Key

Go to console.anthropic.com and create a key under Settings. Copy it right away — Anthropic won’t show it again.

Store it as an environment variable. Never put keys in code.

bash

# macOS / Linux
export ANTHROPIC_API_KEY="sk-ant-your-key-here"

# Windows PowerShell
$env:ANTHROPIC_API_KEY = "sk-ant-your-key-here"

Which Model Should You Pick?

Claude comes in several sizes. Here’s a quick guide.

Model	Best For	Speed	Cost
Claude Opus 4	Complex reasoning, research	Slowest	Highest
Claude Sonnet 4.5	Balanced quality and speed	Medium	Medium
Claude Haiku 4.5	Fast, simple tasks	Fastest	Lowest

We use claude-sonnet-4-5-20250514 in this tutorial. It handles chat, tools, and thinking well.

Note: This tutorial runs in Pyodide, so it can’t call api.anthropic.com directly. We build the exact JSON you’d send with `requests.post()` or `curl`. Here’s what a real call looks like:
“`python

Real HTTP call (not runnable in Pyodide)

import requests

resp = requests.post(

“https://api.anthropic.com/v1/messages”,

headers=headers,

json=request_body

)

data = resp.json()

python


<div class="callout callout-key-insight"><strong>Key Insight:</strong> <strong>One endpoint handles everything.</strong> Messages, streaming, tools, thinking, caching — the URL and headers stay the same. Only the request body changes.</div>


---

## How Does the Message Format Work? {#message-format}

Here's the biggest difference from OpenAI: the system prompt sits outside the messages array. It's a top-level field.

Why? The system prompt guides every response. It's not a turn — it's context. Keeping it separate makes that role clear.

```python
request_body = {
    "model": "claude-sonnet-4-5-20250514",
    "max_tokens": 1024,
    "system": "You are a data science tutor. Keep answers under 3 sentences.",
    "messages": [
        {"role": "user", "content": "What is gradient descent in one sentence?"}
    ]
}

print(json.dumps(request_body, indent=2))

python

{
  "model": "claude-sonnet-4-5-20250514",
  "max_tokens": 1024,
  "system": "You are a data science tutor. Keep answers under 3 sentences.",
  "messages": [
    {
      "role": "user",
      "content": "What is gradient descent in one sentence?"
    }
  ]
}

See how system is at the same level as model? In OpenAI, you’d put {"role": "system", "content": "..."} inside the messages list. Claude’s design is cleaner.

Quick check: What happens if you add {"role": "system"} to the messages array? A 400 error. Claude only allows "user" and "assistant" as message roles.

Multi-Turn Conversations

Conversations alternate between user and assistant. You send the full history each time. Claude doesn’t remember past calls.

multi_turn = {
    "model": "claude-sonnet-4-5-20250514",
    "max_tokens": 1024,
    "system": "You are a data science tutor.",
    "messages": [
        {"role": "user", "content": "What is overfitting?"},
        {"role": "assistant", "content": "Overfitting is when a model learns noise instead of the real pattern."},
        {"role": "user", "content": "How do I detect it?"}
    ]
}

print(f"Turns: {len(multi_turn['messages'])}")
print(f"Follow-up: {multi_turn['messages'][-1]['content']}")

python

Turns: 3
Follow-up: How do I detect it?

Drop a message and Claude loses context. Costs grow each turn because the full history travels every time.

Tip: Keep the system prompt short. Claude reads it on every call. “You are a Python tutor. Answer with code. Max 3 sentences.” — that’s 14 tokens. A full paragraph could be 80 tokens on every request.

How Do You Read the Response?

Claude returns JSON. The key field is content — an array of blocks. For plain text, you get one block with type: "text".

response = {
    "id": "msg_01XFDUDYJgAACzvnptvVoYEL",
    "type": "message",
    "role": "assistant",
    "model": "claude-sonnet-4-5-20250514",
    "content": [
        {
            "type": "text",
            "text": "Gradient descent adjusts parameters by moving toward lower loss."
        }
    ],
    "stop_reason": "end_turn",
    "usage": {"input_tokens": 42, "output_tokens": 14}
}

answer = response["content"][0]["text"]
print(f"Answer: {answer}")
print(f"Tokens: {response['usage']['input_tokens']} in, {response['usage']['output_tokens']} out")
print(f"Stopped: {response['stop_reason']}")

python

Answer: Gradient descent adjusts parameters by moving toward lower loss.
Tokens: 42 in, 14 out
Stopped: end_turn

Why is content an array? Because Claude can return multiple blocks. Text, tool calls, and thinking all come back as separate blocks in the same array.

Three stop reasons to know:

`stop_reason`	Meaning	What to do
`end_turn`	Claude finished	Show the answer
`max_tokens`	Hit your limit	Response cut off — raise `max_tokens`
`tool_use`	Wants to call a tool	Run it and feed the result back

How Does Streaming Work?

Without streaming, you wait for the full answer. The user sees nothing. With streaming, text arrives chunk by chunk.

Set "stream": true in the request. Claude sends Server-Sent Events (SSE) instead of one JSON blob.

stream_req = {
    "model": "claude-sonnet-4-5-20250514",
    "max_tokens": 1024,
    "stream": True,
    "messages": [
        {"role": "user", "content": "Explain backpropagation in 2 sentences."}
    ]
}

print(f"Streaming: {stream_req['stream']}")
print("Response: SSE events instead of a single JSON object")

python

Streaming: True
Response: SSE events instead of a single JSON object

What Do SSE Events Look Like?

Events arrive in order. message_start first. Then content_block_start. content_block_delta events carry text, one piece at a time. Join them to build the answer.

events = [
    {"type": "message_start", "message": {"id": "msg_01...", "model": "claude-sonnet-4-5-20250514"}},
    {"type": "content_block_start", "index": 0, "content_block": {"type": "text", "text": ""}},
    {"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": "Back"}},
    {"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": "propagation"}},
    {"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": " computes"}},
    {"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": " the gradient"}},
    {"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": " of the loss."}},
    {"type": "content_block_stop", "index": 0},
    {"type": "message_delta", "delta": {"stop_reason": "end_turn"}, "usage": {"output_tokens": 12}},
    {"type": "message_stop"}
]

full_text = ""
for event in events:
    if event["type"] == "content_block_delta":
        chunk = event["delta"]["text"]
        full_text += chunk
        print(f"  chunk: '{chunk}'")

print(f"\nFull: {full_text}")

python

  chunk: 'Back'
  chunk: 'propagation'
  chunk: ' computes'
  chunk: ' the gradient'
  chunk: ' of the loss.'

Full: Backpropagation computes the gradient of the loss.

In a chat app, push each chunk to the screen. The user sees words appear — same feel as ChatGPT.

Warning: Don’t skip `message_delta`. It holds the `stop_reason` and final token count. Without it, you can’t tell if Claude finished or got cut off.

Exercise 1: Parse a Streaming Response

Write a function that takes SSE events and returns the full text, stop reason, and output token count.

Hints

1. Text chunks: `content_block_delta` where `delta.type == “text_delta”`.
2. Stop reason: `message_delta` at `delta.stop_reason`.

Solution

def parse_stream(events):
    text = ""
    stop_reason = None
    output_tokens = 0
    for event in events:
        if event["type"] == "content_block_delta":
            if event["delta"]["type"] == "text_delta":
                text += event["delta"]["text"]
        elif event["type"] == "message_delta":
            stop_reason = event["delta"]["stop_reason"]
            output_tokens = event["usage"]["output_tokens"]
    return text, stop_reason, output_tokens

test = [
    {"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": "Hello "}},
    {"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": "world!"}},
    {"type": "message_delta", "delta": {"stop_reason": "end_turn"}, "usage": {"output_tokens": 3}},
]
text, reason, tokens = parse_stream(test)
print(f"Text: '{text}' | Stop: {reason} | Tokens: {tokens}")

python

Text: 'Hello world!' | Stop: end_turn | Tokens: 3

What Is Tool Use?

Claude can’t browse the web or run code on its own. But you can give it tools — functions it asks you to call. Claude picks the right tool, tells you the arguments, and you send back the result.

Three steps every time:
1. Define tools in the request
2. Handle tool_use blocks in the response
3. Feed results back as tool_result messages

Step 1: Define Your Tools

Each tool has a name, a description, and an input_schema. Claude reads the description to decide when to use the tool.

weather_tool = {
    "name": "get_weather",
    "description": "Get current weather for a city. Use when the user asks about weather.",
    "input_schema": {
        "type": "object",
        "properties": {
            "city": {"type": "string", "description": "City name, e.g. 'Tokyo'"},
            "units": {
                "type": "string",
                "enum": ["celsius", "fahrenheit"],
                "description": "Temperature units. Defaults to celsius."
            }
        },
        "required": ["city"]
    }
}

print(f"Tool: {weather_tool['name']}")
print(f"Required: {weather_tool['input_schema']['required']}")
print(f"Optional: ['units']")

python

Tool: get_weather
Required: ['city']
Optional: ['units']

Write clear descriptions. “Get current weather for a city” works well. Something vague like “do weather stuff” might confuse Claude into skipping the tool.

Step 2: Handle the `tool_use` Response

When Claude wants a tool, the response changes. stop_reason is "tool_use". A tool_use block appears in content with the tool name, unique ID, and arguments.

tool_resp = {
    "role": "assistant",
    "content": [
        {"type": "text", "text": "I'll check the weather in Paris."},
        {
            "type": "tool_use",
            "id": "toolu_01A09q90qw90lq917835lhm",
            "name": "get_weather",
            "input": {"city": "Paris", "units": "celsius"}
        }
    ],
    "stop_reason": "tool_use"
}

for block in tool_resp["content"]:
    if block["type"] == "text":
        print(f"Claude: {block['text']}")
    elif block["type"] == "tool_use":
        print(f"Tool: {block['name']} | ID: {block['id']}")
        print(f"Args: {json.dumps(block['input'])}")

python

Claude: I'll check the weather in Paris.
Tool: get_weather | ID: toolu_01A09q90qw90lq917835lhm
Args: {"city": "Paris", "units": "celsius"}

Step 3: Feed the Result Back

You run the function yourself. Then wrap the output in a tool_result message. The tool_use_id must match the id Claude gave you.

def get_weather(city, units="celsius"):
    data = {
        "Paris": {"temp": 22, "condition": "Partly cloudy", "humidity": 65},
        "London": {"temp": 15, "condition": "Rainy", "humidity": 80},
    }
    return json.dumps(data.get(city, {"temp": 20, "condition": "Unknown"}))

tool_block = tool_resp["content"][1]
result = get_weather(**tool_block["input"])
print(f"Result: {result}")

msg = {
    "role": "user",
    "content": [{
        "type": "tool_result",
        "tool_use_id": tool_block["id"],
        "content": result
    }]
}
print(f"IDs match: {msg['content'][0]['tool_use_id'] == tool_block['id']}")

python

Result: {"temp": 22, "condition": "Partly cloudy", "humidity": 65}
IDs match: True

Claude gets this result and writes something like: “It’s 22 degrees and partly cloudy in Paris right now.”

Key Insight: Tool use is a conversation, not a function call. Claude says “please run this.” You run it and report back. You control what executes.

Exercise 2: Build a Multi-Tool Request

Create a request body with two tools: a temperature converter and a BMI calculator. Use JSON Schema for both.

Hints

1. Each tool needs `name`, `description`, and `input_schema`.
2. Use `enum` on the direction: `[“c_to_f”, “f_to_c”]`.

Solution

temp_tool = {
    "name": "convert_temp",
    "description": "Convert between Celsius and Fahrenheit.",
    "input_schema": {
        "type": "object",
        "properties": {
            "value": {"type": "number", "description": "Temperature value"},
            "direction": {"type": "string", "enum": ["c_to_f", "f_to_c"]}
        },
        "required": ["value", "direction"]
    }
}

bmi_tool = {
    "name": "calc_bmi",
    "description": "Calculate BMI from height (m) and weight (kg).",
    "input_schema": {
        "type": "object",
        "properties": {
            "height_m": {"type": "number"},
            "weight_kg": {"type": "number"}
        },
        "required": ["height_m", "weight_kg"]
    }
}

req = {
    "model": "claude-sonnet-4-5-20250514",
    "max_tokens": 1024,
    "tools": [temp_tool, bmi_tool],
    "messages": [{"role": "user", "content": "Convert 98.6F to Celsius"}]
}
print(f"Tools: {[t['name'] for t in req['tools']]}")

python

Tools: ['convert_temp', 'calc_bmi']

How Does Extended Thinking Work?

Some questions need more reasoning. Math, multi-step logic, code debugging — Claude does better when it thinks before answering.

Add a thinking object to the request. budget_tokens sets the max reasoning tokens. This is separate from max_tokens.

thinking_req = {
    "model": "claude-sonnet-4-5-20250514",
    "max_tokens": 8000,
    "thinking": {
        "type": "enabled",
        "budget_tokens": 5000
    },
    "messages": [
        {"role": "user", "content": "What is 847 * 293? Show your work."}
    ]
}

print(f"Answer limit: {thinking_req['max_tokens']} tokens")
print(f"Thinking budget: {thinking_req['thinking']['budget_tokens']} tokens")

python

Answer limit: 8000 tokens
Thinking budget: 5000 tokens

What Comes Back?

The response has a thinking block, then a text block. The thinking block shows Claude’s reasoning. The text block is the clean answer.

thinking_resp = {
    "content": [
        {
            "type": "thinking",
            "thinking": "847 * 293\n= 847 * 300 - 847 * 7\n= 254100 - 5929\n= 248171",
            "signature": "WaUjzkypQ2mUEVM36O..."
        },
        {
            "type": "text",
            "text": "847 x 293 = 248,171\n\n- 847 x 300 = 254,100\n- 847 x 7 = 5,929\n- 254,100 - 5,929 = 248,171"
        }
    ]
}

for block in thinking_resp["content"]:
    if block["type"] == "thinking":
        print("--- THINKING ---")
        print(block["thinking"])
    elif block["type"] == "text":
        print("\n--- ANSWER ---")
        print(block["text"])

python

--- THINKING ---
847 * 293
= 847 * 300 - 847 * 7
= 254100 - 5929
= 248171

--- ANSWER ---
847 x 293 = 248,171

- 847 x 300 = 254,100
- 847 x 7 = 5,929
- 254,100 - 5,929 = 248,171

In a chat app, show the answer. Offer a “Show reasoning” toggle for the thinking block.

The signature field matters. If you continue the conversation, pass the thinking block back as-is. Claude verifies it hasn’t been changed.

Thinking Display Modes

You can control what the thinking block contains.

Summarized (default) — a short version of the reasoning. Still billed for the full thinking tokens.

Omitted — empty thinking field, just the signature. Faster, since no thinking text streams. Still billed.

omitted = {
    "model": "claude-sonnet-4-5-20250514",
    "max_tokens": 8000,
    "thinking": {
        "type": "enabled",
        "budget_tokens": 5000,
        "display": "omitted"
    },
    "messages": [{"role": "user", "content": "What is 27 * 453?"}]
}

print(f"Display: {omitted['thinking']['display']}")
print("Result: empty thinking field, faster time-to-first-text")

python

Display: omitted
Result: empty thinking field, faster time-to-first-text

Thinking + Tool Use Together

Claude can think and then call tools. The key rule: pass thinking blocks back unchanged when you send tool results.

think_tool = {
    "content": [
        {"type": "thinking", "thinking": "User wants weather. I'll use get_weather.", "signature": "abc..."},
        {"type": "tool_use", "id": "toolu_t01", "name": "get_weather", "input": {"city": "Tokyo"}}
    ],
    "stop_reason": "tool_use"
}

print("Blocks to include in your assistant message:")
for b in think_tool["content"]:
    print(f"  {b['type']}")
print("Keep both — don't strip the thinking block!")

python

Blocks to include in your assistant message:
  thinking
  tool_use
Keep both — don't strip the thinking block!

Warning: `budget_tokens` must be less than `max_tokens`. Otherwise the API returns an error. Set max_tokens = thinking budget + expected answer.

How Does Prompt Caching Cut Costs?

Every call re-reads your system prompt. If you send 2,000 tokens each time, you pay each time.

Caching stores that content on the server. Cached tokens cost 90% less. First call pays a small extra to create the cache. Every call after that saves.

Add cache_control to any content block you want cached.

cached = {
    "model": "claude-sonnet-4-5-20250514",
    "max_tokens": 1024,
    "system": [
        {
            "type": "text",
            "text": "You are a Python tutor. Use f-strings. Follow PEP 8.",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    "messages": [{"role": "user", "content": "How do I reverse a list?"}]
}

print(f"System: {len(cached['system'][0]['text'])} chars")
print(f"Cache: {cached['system'][0]['cache_control']}")
print("Note: system is now an ARRAY, not a string")

python

System: 52 chars
Cache: {'type': 'ephemeral'}
Note: system is now an ARRAY, not a string

The system field changed from a string to an array. You need the block structure to attach cache_control.

How to Read Cache Metrics

The usage object shows what happened with the cache.

call_1 = {"cache_creation_input_tokens": 62, "cache_read_input_tokens": 0}
call_2 = {"cache_creation_input_tokens": 0, "cache_read_input_tokens": 62}

print("Call 1: cache created —", call_1["cache_creation_input_tokens"], "tokens stored")
print("Call 2: cache hit —", call_2["cache_read_input_tokens"], "tokens read (90% cheaper)")

python

Call 1: cache created — 62 tokens stored
Call 2: cache hit — 62 tokens read (90% cheaper)

For apps with big system prompts and many calls, the savings stack up fast.

Tip: Cache tool definitions too. Same tools every call? Mark them cacheable. Schemas can be hundreds of tokens.

How Do You Handle Errors?

Things break. Rate limits, bad payloads, server hiccups. Here’s the error table.

Status	Meaning	Action
400	Bad request	Fix your JSON body
401	Bad API key	Check your key
429	Rate limited	Wait and retry
500	Server error	Retry shortly
529	Overloaded	Wait longer

The 429 is most common. Use exponential backoff: wait 1s, 2s, 4s, 8s.

def show_backoff(retries=4):
    for attempt in range(retries):
        wait = 2 ** attempt
        print(f"  Attempt {attempt + 1}: wait {wait}s")

print("Exponential backoff for 429:")
show_backoff()

python

Exponential backoff for 429:
  Attempt 1: wait 1s
  Attempt 2: wait 2s
  Attempt 3: wait 4s
  Attempt 4: wait 8s

The anthropic SDK handles retries for you. With raw HTTP, you build it yourself.

How Do You Put It All Together?

Let’s build an assistant with two tools: a calculator and a concept lookup.

calculator_tool = {
    "name": "calculate",
    "description": "Evaluate a math expression.",
    "input_schema": {
        "type": "object",
        "properties": {
            "expression": {"type": "string", "description": "Python math, e.g. '2**10'"}
        },
        "required": ["expression"]
    }
}

lookup_tool = {
    "name": "lookup_concept",
    "description": "Look up a data science concept.",
    "input_schema": {
        "type": "object",
        "properties": {
            "concept": {"type": "string", "description": "Concept name"}
        },
        "required": ["concept"]
    }
}

tools = [calculator_tool, lookup_tool]
print(f"Tools: {[t['name'] for t in tools]}")

python

Tools: ['calculate', 'lookup_concept']

The tool runner does the work and returns JSON.

def run_tool(name, args):
    if name == "calculate":
        try:
            result = eval(args["expression"], {"__builtins__": {}})
            return json.dumps({"result": result})
        except Exception as e:
            return json.dumps({"error": str(e)})
    elif name == "lookup_concept":
        db = {
            "random forest": "Ensemble of decision trees on random subsets.",
            "gradient descent": "Move parameters toward lower loss.",
            "overfitting": "Model memorizes noise instead of patterns."
        }
        return json.dumps({"definition": db.get(args["concept"].lower(), "Not found.")})
    return json.dumps({"error": f"Unknown tool: {name}"})

print(run_tool("calculate", {"expression": "2**10 + 42"}))
print(run_tool("lookup_concept", {"concept": "random forest"}))

python

{"result": 1066}
{"definition": "Ensemble of decision trees on random subsets."}

The Full Tool Loop

Check for tool_use blocks, run each tool, and build result messages.

def process_response(content):
    texts, results = [], []
    for block in content:
        if block["type"] == "text":
            texts.append(block["text"])
        elif block["type"] == "tool_use":
            r = run_tool(block["name"], block["input"])
            results.append({
                "type": "tool_result",
                "tool_use_id": block["id"],
                "content": r
            })
    return texts, results

sim = [
    {"type": "text", "text": "Let me calculate that."},
    {"type": "tool_use", "id": "toolu_001", "name": "calculate", "input": {"expression": "847 * 293"}}
]

texts, results = process_response(sim)
print(f"Claude: {texts[0]}")
print(f"Result: {results[0]['content']}")

python

Claude: Let me calculate that.
Result: {"result": 248171}

Keep sending results until stop_reason is "end_turn". Claude may call several tools before the final answer.

Exercise 3: Handle a Tool-Use Response

Given a response with a tool_use block, extract the tool name, run it, and build a tool_result with the correct ID.

Hints

1. Find the block where `type == “tool_use”`.
2. Copy its `id` into `tool_use_id`.

Solution

resp = [
    {"type": "text", "text": "Looking that up."},
    {"type": "tool_use", "id": "toolu_ex3", "name": "lookup_concept", "input": {"concept": "overfitting"}}
]

tb = next(b for b in resp if b["type"] == "tool_use")
result = run_tool(tb["name"], tb["input"])
print(f"Tool: {tb['name']} | Result: {result}")

msg = {"role": "user", "content": [{"type": "tool_result", "tool_use_id": tb["id"], "content": result}]}
print(f"ID match: {msg['content'][0]['tool_use_id'] == tb['id']}")

python

Tool: lookup_concept | Result: {"definition": "Model memorizes noise instead of patterns."}
ID match: True

Common Mistakes and Fixes

Mistake 1: Missing `anthropic-version`

bad = {"x-api-key": "key", "content-type": "application/json"}
print(f"Headers: {list(bad.keys())}")
print("API returns 400 — 'anthropic-version' required on every call")

python

Headers: ['x-api-key', 'content-type']
API returns 400 — 'anthropic-version' required on every call

Fix: Send all three headers, always.

Mistake 2: System Prompt in Messages

wrong = [{"role": "system", "content": "Be helpful."}, {"role": "user", "content": "Hi!"}]
print(f"Role: '{wrong[0]['role']}' — not valid in messages")
print("Fix: use the top-level 'system' field")

python

Role: 'system' — not valid in messages
Fix: use the top-level 'system' field

Mistake 3: Wrong `tool_use_id`

print(f"Claude's ID: toolu_abc123")
print(f"Your ID:     toolu_WRONG")
print(f"Match: False — API returns 400")
print("Fix: copy the id from Claude's tool_use block")

python

Claude's ID: toolu_abc123
Your ID:     toolu_WRONG
Match: False — API returns 400
Fix: copy the id from Claude's tool_use block

When Should You Skip Raw HTTP?

Raw HTTP gives full control. But it adds work.

Use the anthropic SDK for retries, typed objects, and SSE parsing.

Use LangChain when you chain calls, swap providers, or need memory.

Stick with raw HTTP for learning, debugging, minimal deps, or Pyodide.

Summary

Everything you learned:

Messages — system prompt separate, history in the messages array
Streaming — "stream": true, join text_delta chunks from SSE events
Tool use — define, handle tool_use blocks, feed results with matching IDs
Extended thinking — budget_tokens for reasoning, pass blocks back
Prompt caching — cache_control saves 90% on repeated content
Errors — exponential backoff for 429, check status codes

One endpoint. Same headers. Different bodies. That’s the pattern.

Practice Exercise

Build a function that handles up to 3 tool rounds with error handling for unknown tools.

Solution

def assistant_loop(question, tool_defs, max_rounds=3):
    messages = [{"role": "user", "content": question}]
    known = {t["name"] for t in tool_defs}

    for rnd in range(max_rounds):
        if rnd == 0:
            resp = [{"type": "tool_use", "id": f"t_{rnd}", "name": "calculate", "input": {"expression": "42*7"}}]
            stop = "tool_use"
        else:
            resp = [{"type": "text", "text": "42 * 7 = 294."}]
            stop = "end_turn"

        if stop == "end_turn":
            return next(b["text"] for b in resp if b["type"] == "text")

        messages.append({"role": "assistant", "content": resp})
        results = []
        for b in resp:
            if b["type"] == "tool_use":
                r = run_tool(b["name"], b["input"]) if b["name"] in known else json.dumps({"error": "Unknown"})
                results.append({"type": "tool_result", "tool_use_id": b["id"], "content": r})
        messages.append({"role": "user", "content": results})
    return "Max rounds."

print(assistant_loop("42 * 7?", tools))

python

42 * 7 = 294.

Frequently Asked Questions

What’s the difference between `max_tokens` and `budget_tokens`?

max_tokens caps the visible answer. budget_tokens caps thinking. Think of it as: max_tokens = what the user sees, budget_tokens = Claude’s scratch paper. Budget must be smaller.

Can Claude call multiple tools at once?

Yes. Several tool_use blocks can appear in one response. Run them all, send results back together. Each needs its own tool_use_id.

How long does the prompt cache last?

About 5 minutes by default. Up to 1 hour for long tasks. Any change to cached content breaks the cache.

Does streaming work with tools and thinking?

Yes. Tools come as SSE tool_use blocks. Thinking comes as thinking_delta events. Same structure you learned.

Complete Code

Click to expand the full script

# Anthropic Claude API — Complete Code
# Python 3.9+
import json

headers = {
    "x-api-key": "sk-ant-your-key-here",
    "anthropic-version": "2023-06-01",
    "content-type": "application/json"
}

request = {
    "model": "claude-sonnet-4-5-20250514",
    "max_tokens": 1024,
    "system": "You are a data science tutor.",
    "messages": [{"role": "user", "content": "What is gradient descent?"}]
}

def parse_stream(events):
    text, stop, tokens = "", None, 0
    for e in events:
        if e["type"] == "content_block_delta" and e["delta"]["type"] == "text_delta":
            text += e["delta"]["text"]
        elif e["type"] == "message_delta":
            stop = e["delta"]["stop_reason"]
            tokens = e["usage"]["output_tokens"]
    return text, stop, tokens

def run_tool(name, args):
    if name == "calculate":
        try:
            return json.dumps({"result": eval(args["expression"], {"__builtins__": {}})})
        except Exception as e:
            return json.dumps({"error": str(e)})
    elif name == "lookup_concept":
        db = {"random forest": "Ensemble of trees.", "overfitting": "Memorizing noise."}
        return json.dumps({"definition": db.get(args["concept"].lower(), "Not found.")})
    return json.dumps({"error": f"Unknown: {name}"})

def process_response(content):
    texts, results = [], []
    for block in content:
        if block["type"] == "text":
            texts.append(block["text"])
        elif block["type"] == "tool_use":
            r = run_tool(block["name"], block["input"])
            results.append({"type": "tool_result", "tool_use_id": block["id"], "content": r})
    return texts, results

print("Loaded. Test: " + run_tool("calculate", {"expression": "2**10"}))

References

Anthropic — Messages API Reference. Link
Anthropic — Streaming Messages. Link
Anthropic — Tool Use Guide. Link
Anthropic — Extended Thinking. Link
Anthropic — Prompt Caching. Link
Anthropic — API Overview. Link
Anthropic — Python SDK. Link
Anthropic — Models and Pricing. Link

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Gen AI — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

What Is the Claude Messages API?

Getting Your API Key

Which Model Should You Pick?

Real HTTP call (not runnable in Pyodide)

import requests

resp = requests.post(

“https://api.anthropic.com/v1/messages”,

headers=headers,

json=request_body

)

data = resp.json()

Multi-Turn Conversations

How Do You Read the Response?

How Does Streaming Work?

What Do SSE Events Look Like?

Exercise 1: Parse a Streaming Response

What Is Tool Use?

Step 1: Define Your Tools

Step 2: Handle the tool_use Response

Step 3: Feed the Result Back

Exercise 2: Build a Multi-Tool Request

How Does Extended Thinking Work?

What Comes Back?

Thinking Display Modes

Thinking + Tool Use Together

How Does Prompt Caching Cut Costs?

How to Read Cache Metrics

How Do You Handle Errors?

How Do You Put It All Together?

The Full Tool Loop

Exercise 3: Handle a Tool-Use Response

Common Mistakes and Fixes

Mistake 1: Missing anthropic-version

Mistake 2: System Prompt in Messages

Mistake 3: Wrong tool_use_id

When Should You Skip Raw HTTP?

Summary

Practice Exercise

Frequently Asked Questions

What’s the difference between max_tokens and budget_tokens?

Can Claude call multiple tools at once?

How long does the prompt cache last?

Does streaming work with tools and thinking?

Complete Code

References

Related Articles

LLM Temperature, Top-P, and Top-K Explained — With Python Simulations

OpenAI API Python Tutorial — Chat Completions, Streaming & Error Handling

Build a Multi-Provider LLM Toolkit (Python Project)

Get Your Free AI/ML Engineer Roadmap

Want help choosing the right AI/ML path?

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Step 2: Handle the `tool_use` Response

Mistake 1: Missing `anthropic-version`

Mistake 3: Wrong `tool_use_id`

What’s the difference between `max_tokens` and `budget_tokens`?