Menu

Claude API Tutorial: Messages, Tools & Streaming

Master the Claude API with raw HTTP — messages, streaming, tool use, extended thinking, and prompt caching with runnable Python code examples.

Written by Selva Prabhakaran | 24 min read

Send messages, stream responses, call tools, and enable extended thinking — all with raw HTTP requests to the Claude API.

This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.

You want to add Claude to your Python project. You open the docs and find a Python SDK. But what’s actually happening under the hood? What headers does the request need? What does the JSON look like?

This tutorial skips the SDK. You’ll build raw HTTP request bodies so you see exactly what goes over the wire. By the end, you’ll know how to send messages, stream responses, use tools, enable extended thinking, and cache prompts — all with code that runs in your browser.

What Is the Claude Messages API?

The Messages API is Claude’s single endpoint. Text, tool calls, thinking, streaming — everything goes through one URL:

python
https://api.anthropic.com/v1/messages

Every request needs three headers. Here’s what they look like.

import micropip
await micropip.install('requests')

import json

headers = {
    "x-api-key": "sk-ant-your-key-here",
    "anthropic-version": "2023-06-01",
    "content-type": "application/json"
}

print("Required headers for every Claude API call:")
for key, value in headers.items():
    print(f"  {key}: {value}")
python
Required headers for every Claude API call:
  x-api-key: sk-ant-your-key-here
  anthropic-version: 2023-06-01
  content-type: application/json

The x-api-key authenticates you. anthropic-version pins behavior to a stable release — 2023-06-01 is current as of 2026. content-type is always JSON.

Getting Your API Key

Go to console.anthropic.com and create a key under Settings. Copy it right away — Anthropic won’t show it again.

Store it as an environment variable. Never put keys in code.

bash
# macOS / Linux
export ANTHROPIC_API_KEY="sk-ant-your-key-here"

# Windows PowerShell
$env:ANTHROPIC_API_KEY = "sk-ant-your-key-here"

Which Model Should You Pick?

Claude comes in several sizes. Here’s a quick guide.

Model Best For Speed Cost
Claude Opus 4 Complex reasoning, research Slowest Highest
Claude Sonnet 4.5 Balanced quality and speed Medium Medium
Claude Haiku 4.5 Fast, simple tasks Fastest Lowest

We use claude-sonnet-4-5-20250514 in this tutorial. It handles chat, tools, and thinking well.

Note: This tutorial runs in Pyodide, so it can’t call api.anthropic.com directly. We build the exact JSON you’d send with `requests.post()` or `curl`. Here’s what a real call looks like:
“`python

Real HTTP call (not runnable in Pyodide)

import requests

resp = requests.post(

“https://api.anthropic.com/v1/messages”,

headers=headers,

json=request_body

)

data = resp.json()

python

<div class="callout callout-key-insight"><strong>Key Insight:</strong> <strong>One endpoint handles everything.</strong> Messages, streaming, tools, thinking, caching — the URL and headers stay the same. Only the request body changes.</div>


---

## How Does the Message Format Work? {#message-format}

Here's the biggest difference from OpenAI: the system prompt sits outside the messages array. It's a top-level field.

Why? The system prompt guides every response. It's not a turn — it's context. Keeping it separate makes that role clear.

```python
request_body = {
    "model": "claude-sonnet-4-5-20250514",
    "max_tokens": 1024,
    "system": "You are a data science tutor. Keep answers under 3 sentences.",
    "messages": [
        {"role": "user", "content": "What is gradient descent in one sentence?"}
    ]
}

print(json.dumps(request_body, indent=2))
python
{
  "model": "claude-sonnet-4-5-20250514",
  "max_tokens": 1024,
  "system": "You are a data science tutor. Keep answers under 3 sentences.",
  "messages": [
    {
      "role": "user",
      "content": "What is gradient descent in one sentence?"
    }
  ]
}

See how system is at the same level as model? In OpenAI, you’d put {"role": "system", "content": "..."} inside the messages list. Claude’s design is cleaner.

Quick check: What happens if you add {"role": "system"} to the messages array? A 400 error. Claude only allows "user" and "assistant" as message roles.

Multi-Turn Conversations

Conversations alternate between user and assistant. You send the full history each time. Claude doesn’t remember past calls.

multi_turn = {
    "model": "claude-sonnet-4-5-20250514",
    "max_tokens": 1024,
    "system": "You are a data science tutor.",
    "messages": [
        {"role": "user", "content": "What is overfitting?"},
        {"role": "assistant", "content": "Overfitting is when a model learns noise instead of the real pattern."},
        {"role": "user", "content": "How do I detect it?"}
    ]
}

print(f"Turns: {len(multi_turn['messages'])}")
print(f"Follow-up: {multi_turn['messages'][-1]['content']}")
python
Turns: 3
Follow-up: How do I detect it?

Drop a message and Claude loses context. Costs grow each turn because the full history travels every time.

Tip: Keep the system prompt short. Claude reads it on every call. “You are a Python tutor. Answer with code. Max 3 sentences.” — that’s 14 tokens. A full paragraph could be 80 tokens on every request.

How Do You Read the Response?

Claude returns JSON. The key field is content — an array of blocks. For plain text, you get one block with type: "text".

response = {
    "id": "msg_01XFDUDYJgAACzvnptvVoYEL",
    "type": "message",
    "role": "assistant",
    "model": "claude-sonnet-4-5-20250514",
    "content": [
        {
            "type": "text",
            "text": "Gradient descent adjusts parameters by moving toward lower loss."
        }
    ],
    "stop_reason": "end_turn",
    "usage": {"input_tokens": 42, "output_tokens": 14}
}

answer = response["content"][0]["text"]
print(f"Answer: {answer}")
print(f"Tokens: {response['usage']['input_tokens']} in, {response['usage']['output_tokens']} out")
print(f"Stopped: {response['stop_reason']}")
python
Answer: Gradient descent adjusts parameters by moving toward lower loss.
Tokens: 42 in, 14 out
Stopped: end_turn

Why is content an array? Because Claude can return multiple blocks. Text, tool calls, and thinking all come back as separate blocks in the same array.

Three stop reasons to know:

stop_reason Meaning What to do
end_turn Claude finished Show the answer
max_tokens Hit your limit Response cut off — raise max_tokens
tool_use Wants to call a tool Run it and feed the result back

How Does Streaming Work?

Without streaming, you wait for the full answer. The user sees nothing. With streaming, text arrives chunk by chunk.

Set "stream": true in the request. Claude sends Server-Sent Events (SSE) instead of one JSON blob.

stream_req = {
    "model": "claude-sonnet-4-5-20250514",
    "max_tokens": 1024,
    "stream": True,
    "messages": [
        {"role": "user", "content": "Explain backpropagation in 2 sentences."}
    ]
}

print(f"Streaming: {stream_req['stream']}")
print("Response: SSE events instead of a single JSON object")
python
Streaming: True
Response: SSE events instead of a single JSON object

What Do SSE Events Look Like?

Events arrive in order. message_start first. Then content_block_start. content_block_delta events carry text, one piece at a time. Join them to build the answer.

events = [
    {"type": "message_start", "message": {"id": "msg_01...", "model": "claude-sonnet-4-5-20250514"}},
    {"type": "content_block_start", "index": 0, "content_block": {"type": "text", "text": ""}},
    {"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": "Back"}},
    {"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": "propagation"}},
    {"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": " computes"}},
    {"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": " the gradient"}},
    {"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": " of the loss."}},
    {"type": "content_block_stop", "index": 0},
    {"type": "message_delta", "delta": {"stop_reason": "end_turn"}, "usage": {"output_tokens": 12}},
    {"type": "message_stop"}
]

full_text = ""
for event in events:
    if event["type"] == "content_block_delta":
        chunk = event["delta"]["text"]
        full_text += chunk
        print(f"  chunk: '{chunk}'")

print(f"\nFull: {full_text}")
python
  chunk: 'Back'
  chunk: 'propagation'
  chunk: ' computes'
  chunk: ' the gradient'
  chunk: ' of the loss.'

Full: Backpropagation computes the gradient of the loss.

In a chat app, push each chunk to the screen. The user sees words appear — same feel as ChatGPT.

Warning: Don’t skip `message_delta`. It holds the `stop_reason` and final token count. Without it, you can’t tell if Claude finished or got cut off.

Exercise 1: Parse a Streaming Response

Write a function that takes SSE events and returns the full text, stop reason, and output token count.

Hints

1. Text chunks: `content_block_delta` where `delta.type == “text_delta”`.
2. Stop reason: `message_delta` at `delta.stop_reason`.

Solution
def parse_stream(events):
    text = ""
    stop_reason = None
    output_tokens = 0
    for event in events:
        if event["type"] == "content_block_delta":
            if event["delta"]["type"] == "text_delta":
                text += event["delta"]["text"]
        elif event["type"] == "message_delta":
            stop_reason = event["delta"]["stop_reason"]
            output_tokens = event["usage"]["output_tokens"]
    return text, stop_reason, output_tokens

test = [
    {"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": "Hello "}},
    {"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": "world!"}},
    {"type": "message_delta", "delta": {"stop_reason": "end_turn"}, "usage": {"output_tokens": 3}},
]
text, reason, tokens = parse_stream(test)
print(f"Text: '{text}' | Stop: {reason} | Tokens: {tokens}")
python
Text: 'Hello world!' | Stop: end_turn | Tokens: 3

What Is Tool Use?

Claude can’t browse the web or run code on its own. But you can give it tools — functions it asks you to call. Claude picks the right tool, tells you the arguments, and you send back the result.

Three steps every time:
1. Define tools in the request
2. Handle tool_use blocks in the response
3. Feed results back as tool_result messages

Step 1: Define Your Tools

Each tool has a name, a description, and an input_schema. Claude reads the description to decide when to use the tool.

weather_tool = {
    "name": "get_weather",
    "description": "Get current weather for a city. Use when the user asks about weather.",
    "input_schema": {
        "type": "object",
        "properties": {
            "city": {"type": "string", "description": "City name, e.g. 'Tokyo'"},
            "units": {
                "type": "string",
                "enum": ["celsius", "fahrenheit"],
                "description": "Temperature units. Defaults to celsius."
            }
        },
        "required": ["city"]
    }
}

print(f"Tool: {weather_tool['name']}")
print(f"Required: {weather_tool['input_schema']['required']}")
print(f"Optional: ['units']")
python
Tool: get_weather
Required: ['city']
Optional: ['units']

Write clear descriptions. “Get current weather for a city” works well. Something vague like “do weather stuff” might confuse Claude into skipping the tool.

Step 2: Handle the tool_use Response

When Claude wants a tool, the response changes. stop_reason is "tool_use". A tool_use block appears in content with the tool name, unique ID, and arguments.

tool_resp = {
    "role": "assistant",
    "content": [
        {"type": "text", "text": "I'll check the weather in Paris."},
        {
            "type": "tool_use",
            "id": "toolu_01A09q90qw90lq917835lhm",
            "name": "get_weather",
            "input": {"city": "Paris", "units": "celsius"}
        }
    ],
    "stop_reason": "tool_use"
}

for block in tool_resp["content"]:
    if block["type"] == "text":
        print(f"Claude: {block['text']}")
    elif block["type"] == "tool_use":
        print(f"Tool: {block['name']} | ID: {block['id']}")
        print(f"Args: {json.dumps(block['input'])}")
python
Claude: I'll check the weather in Paris.
Tool: get_weather | ID: toolu_01A09q90qw90lq917835lhm
Args: {"city": "Paris", "units": "celsius"}

Step 3: Feed the Result Back

You run the function yourself. Then wrap the output in a tool_result message. The tool_use_id must match the id Claude gave you.

def get_weather(city, units="celsius"):
    data = {
        "Paris": {"temp": 22, "condition": "Partly cloudy", "humidity": 65},
        "London": {"temp": 15, "condition": "Rainy", "humidity": 80},
    }
    return json.dumps(data.get(city, {"temp": 20, "condition": "Unknown"}))

tool_block = tool_resp["content"][1]
result = get_weather(**tool_block["input"])
print(f"Result: {result}")

msg = {
    "role": "user",
    "content": [{
        "type": "tool_result",
        "tool_use_id": tool_block["id"],
        "content": result
    }]
}
print(f"IDs match: {msg['content'][0]['tool_use_id'] == tool_block['id']}")
python
Result: {"temp": 22, "condition": "Partly cloudy", "humidity": 65}
IDs match: True

Claude gets this result and writes something like: “It’s 22 degrees and partly cloudy in Paris right now.”

Key Insight: Tool use is a conversation, not a function call. Claude says “please run this.” You run it and report back. You control what executes.

Exercise 2: Build a Multi-Tool Request

Create a request body with two tools: a temperature converter and a BMI calculator. Use JSON Schema for both.

Hints

1. Each tool needs `name`, `description`, and `input_schema`.
2. Use `enum` on the direction: `[“c_to_f”, “f_to_c”]`.

Solution
temp_tool = {
    "name": "convert_temp",
    "description": "Convert between Celsius and Fahrenheit.",
    "input_schema": {
        "type": "object",
        "properties": {
            "value": {"type": "number", "description": "Temperature value"},
            "direction": {"type": "string", "enum": ["c_to_f", "f_to_c"]}
        },
        "required": ["value", "direction"]
    }
}

bmi_tool = {
    "name": "calc_bmi",
    "description": "Calculate BMI from height (m) and weight (kg).",
    "input_schema": {
        "type": "object",
        "properties": {
            "height_m": {"type": "number"},
            "weight_kg": {"type": "number"}
        },
        "required": ["height_m", "weight_kg"]
    }
}

req = {
    "model": "claude-sonnet-4-5-20250514",
    "max_tokens": 1024,
    "tools": [temp_tool, bmi_tool],
    "messages": [{"role": "user", "content": "Convert 98.6F to Celsius"}]
}
print(f"Tools: {[t['name'] for t in req['tools']]}")
python
Tools: ['convert_temp', 'calc_bmi']

How Does Extended Thinking Work?

Some questions need more reasoning. Math, multi-step logic, code debugging — Claude does better when it thinks before answering.

Add a thinking object to the request. budget_tokens sets the max reasoning tokens. This is separate from max_tokens.

thinking_req = {
    "model": "claude-sonnet-4-5-20250514",
    "max_tokens": 8000,
    "thinking": {
        "type": "enabled",
        "budget_tokens": 5000
    },
    "messages": [
        {"role": "user", "content": "What is 847 * 293? Show your work."}
    ]
}

print(f"Answer limit: {thinking_req['max_tokens']} tokens")
print(f"Thinking budget: {thinking_req['thinking']['budget_tokens']} tokens")
python
Answer limit: 8000 tokens
Thinking budget: 5000 tokens

What Comes Back?

The response has a thinking block, then a text block. The thinking block shows Claude’s reasoning. The text block is the clean answer.

thinking_resp = {
    "content": [
        {
            "type": "thinking",
            "thinking": "847 * 293\n= 847 * 300 - 847 * 7\n= 254100 - 5929\n= 248171",
            "signature": "WaUjzkypQ2mUEVM36O..."
        },
        {
            "type": "text",
            "text": "847 x 293 = 248,171\n\n- 847 x 300 = 254,100\n- 847 x 7 = 5,929\n- 254,100 - 5,929 = 248,171"
        }
    ]
}

for block in thinking_resp["content"]:
    if block["type"] == "thinking":
        print("--- THINKING ---")
        print(block["thinking"])
    elif block["type"] == "text":
        print("\n--- ANSWER ---")
        print(block["text"])
python
--- THINKING ---
847 * 293
= 847 * 300 - 847 * 7
= 254100 - 5929
= 248171

--- ANSWER ---
847 x 293 = 248,171

- 847 x 300 = 254,100
- 847 x 7 = 5,929
- 254,100 - 5,929 = 248,171

In a chat app, show the answer. Offer a “Show reasoning” toggle for the thinking block.

The signature field matters. If you continue the conversation, pass the thinking block back as-is. Claude verifies it hasn’t been changed.

Thinking Display Modes

You can control what the thinking block contains.

Summarized (default) — a short version of the reasoning. Still billed for the full thinking tokens.

Omitted — empty thinking field, just the signature. Faster, since no thinking text streams. Still billed.

omitted = {
    "model": "claude-sonnet-4-5-20250514",
    "max_tokens": 8000,
    "thinking": {
        "type": "enabled",
        "budget_tokens": 5000,
        "display": "omitted"
    },
    "messages": [{"role": "user", "content": "What is 27 * 453?"}]
}

print(f"Display: {omitted['thinking']['display']}")
print("Result: empty thinking field, faster time-to-first-text")
python
Display: omitted
Result: empty thinking field, faster time-to-first-text

Thinking + Tool Use Together

Claude can think and then call tools. The key rule: pass thinking blocks back unchanged when you send tool results.

think_tool = {
    "content": [
        {"type": "thinking", "thinking": "User wants weather. I'll use get_weather.", "signature": "abc..."},
        {"type": "tool_use", "id": "toolu_t01", "name": "get_weather", "input": {"city": "Tokyo"}}
    ],
    "stop_reason": "tool_use"
}

print("Blocks to include in your assistant message:")
for b in think_tool["content"]:
    print(f"  {b['type']}")
print("Keep both — don't strip the thinking block!")
python
Blocks to include in your assistant message:
  thinking
  tool_use
Keep both — don't strip the thinking block!
Warning: `budget_tokens` must be less than `max_tokens`. Otherwise the API returns an error. Set max_tokens = thinking budget + expected answer.

How Does Prompt Caching Cut Costs?

Every call re-reads your system prompt. If you send 2,000 tokens each time, you pay each time.

Caching stores that content on the server. Cached tokens cost 90% less. First call pays a small extra to create the cache. Every call after that saves.

Add cache_control to any content block you want cached.

cached = {
    "model": "claude-sonnet-4-5-20250514",
    "max_tokens": 1024,
    "system": [
        {
            "type": "text",
            "text": "You are a Python tutor. Use f-strings. Follow PEP 8.",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    "messages": [{"role": "user", "content": "How do I reverse a list?"}]
}

print(f"System: {len(cached['system'][0]['text'])} chars")
print(f"Cache: {cached['system'][0]['cache_control']}")
print("Note: system is now an ARRAY, not a string")
python
System: 52 chars
Cache: {'type': 'ephemeral'}
Note: system is now an ARRAY, not a string

The system field changed from a string to an array. You need the block structure to attach cache_control.

How to Read Cache Metrics

The usage object shows what happened with the cache.

call_1 = {"cache_creation_input_tokens": 62, "cache_read_input_tokens": 0}
call_2 = {"cache_creation_input_tokens": 0, "cache_read_input_tokens": 62}

print("Call 1: cache created —", call_1["cache_creation_input_tokens"], "tokens stored")
print("Call 2: cache hit —", call_2["cache_read_input_tokens"], "tokens read (90% cheaper)")
python
Call 1: cache created — 62 tokens stored
Call 2: cache hit — 62 tokens read (90% cheaper)

For apps with big system prompts and many calls, the savings stack up fast.

Tip: Cache tool definitions too. Same tools every call? Mark them cacheable. Schemas can be hundreds of tokens.

How Do You Handle Errors?

Things break. Rate limits, bad payloads, server hiccups. Here’s the error table.

Status Meaning Action
400 Bad request Fix your JSON body
401 Bad API key Check your key
429 Rate limited Wait and retry
500 Server error Retry shortly
529 Overloaded Wait longer

The 429 is most common. Use exponential backoff: wait 1s, 2s, 4s, 8s.

def show_backoff(retries=4):
    for attempt in range(retries):
        wait = 2 ** attempt
        print(f"  Attempt {attempt + 1}: wait {wait}s")

print("Exponential backoff for 429:")
show_backoff()
python
Exponential backoff for 429:
  Attempt 1: wait 1s
  Attempt 2: wait 2s
  Attempt 3: wait 4s
  Attempt 4: wait 8s

The anthropic SDK handles retries for you. With raw HTTP, you build it yourself.


How Do You Put It All Together?

Let’s build an assistant with two tools: a calculator and a concept lookup.

calculator_tool = {
    "name": "calculate",
    "description": "Evaluate a math expression.",
    "input_schema": {
        "type": "object",
        "properties": {
            "expression": {"type": "string", "description": "Python math, e.g. '2**10'"}
        },
        "required": ["expression"]
    }
}

lookup_tool = {
    "name": "lookup_concept",
    "description": "Look up a data science concept.",
    "input_schema": {
        "type": "object",
        "properties": {
            "concept": {"type": "string", "description": "Concept name"}
        },
        "required": ["concept"]
    }
}

tools = [calculator_tool, lookup_tool]
print(f"Tools: {[t['name'] for t in tools]}")
python
Tools: ['calculate', 'lookup_concept']

The tool runner does the work and returns JSON.

def run_tool(name, args):
    if name == "calculate":
        try:
            result = eval(args["expression"], {"__builtins__": {}})
            return json.dumps({"result": result})
        except Exception as e:
            return json.dumps({"error": str(e)})
    elif name == "lookup_concept":
        db = {
            "random forest": "Ensemble of decision trees on random subsets.",
            "gradient descent": "Move parameters toward lower loss.",
            "overfitting": "Model memorizes noise instead of patterns."
        }
        return json.dumps({"definition": db.get(args["concept"].lower(), "Not found.")})
    return json.dumps({"error": f"Unknown tool: {name}"})

print(run_tool("calculate", {"expression": "2**10 + 42"}))
print(run_tool("lookup_concept", {"concept": "random forest"}))
python
{"result": 1066}
{"definition": "Ensemble of decision trees on random subsets."}

The Full Tool Loop

Check for tool_use blocks, run each tool, and build result messages.

def process_response(content):
    texts, results = [], []
    for block in content:
        if block["type"] == "text":
            texts.append(block["text"])
        elif block["type"] == "tool_use":
            r = run_tool(block["name"], block["input"])
            results.append({
                "type": "tool_result",
                "tool_use_id": block["id"],
                "content": r
            })
    return texts, results

sim = [
    {"type": "text", "text": "Let me calculate that."},
    {"type": "tool_use", "id": "toolu_001", "name": "calculate", "input": {"expression": "847 * 293"}}
]

texts, results = process_response(sim)
print(f"Claude: {texts[0]}")
print(f"Result: {results[0]['content']}")
python
Claude: Let me calculate that.
Result: {"result": 248171}

Keep sending results until stop_reason is "end_turn". Claude may call several tools before the final answer.

Exercise 3: Handle a Tool-Use Response

Given a response with a tool_use block, extract the tool name, run it, and build a tool_result with the correct ID.

Hints

1. Find the block where `type == “tool_use”`.
2. Copy its `id` into `tool_use_id`.

Solution
resp = [
    {"type": "text", "text": "Looking that up."},
    {"type": "tool_use", "id": "toolu_ex3", "name": "lookup_concept", "input": {"concept": "overfitting"}}
]

tb = next(b for b in resp if b["type"] == "tool_use")
result = run_tool(tb["name"], tb["input"])
print(f"Tool: {tb['name']} | Result: {result}")

msg = {"role": "user", "content": [{"type": "tool_result", "tool_use_id": tb["id"], "content": result}]}
print(f"ID match: {msg['content'][0]['tool_use_id'] == tb['id']}")
python
Tool: lookup_concept | Result: {"definition": "Model memorizes noise instead of patterns."}
ID match: True

Common Mistakes and Fixes

Mistake 1: Missing anthropic-version

bad = {"x-api-key": "key", "content-type": "application/json"}
print(f"Headers: {list(bad.keys())}")
print("API returns 400 — 'anthropic-version' required on every call")
python
Headers: ['x-api-key', 'content-type']
API returns 400 — 'anthropic-version' required on every call

Fix: Send all three headers, always.

Mistake 2: System Prompt in Messages

wrong = [{"role": "system", "content": "Be helpful."}, {"role": "user", "content": "Hi!"}]
print(f"Role: '{wrong[0]['role']}' — not valid in messages")
print("Fix: use the top-level 'system' field")
python
Role: 'system' — not valid in messages
Fix: use the top-level 'system' field

Mistake 3: Wrong tool_use_id

print(f"Claude's ID: toolu_abc123")
print(f"Your ID:     toolu_WRONG")
print(f"Match: False — API returns 400")
print("Fix: copy the id from Claude's tool_use block")
python
Claude's ID: toolu_abc123
Your ID:     toolu_WRONG
Match: False — API returns 400
Fix: copy the id from Claude's tool_use block

When Should You Skip Raw HTTP?

Raw HTTP gives full control. But it adds work.

Use the anthropic SDK for retries, typed objects, and SSE parsing.

Use LangChain when you chain calls, swap providers, or need memory.

Stick with raw HTTP for learning, debugging, minimal deps, or Pyodide.


Summary

Everything you learned:

  • Messages — system prompt separate, history in the messages array
  • Streaming"stream": true, join text_delta chunks from SSE events
  • Tool use — define, handle tool_use blocks, feed results with matching IDs
  • Extended thinkingbudget_tokens for reasoning, pass blocks back
  • Prompt cachingcache_control saves 90% on repeated content
  • Errors — exponential backoff for 429, check status codes

One endpoint. Same headers. Different bodies. That’s the pattern.

Practice Exercise

Build a function that handles up to 3 tool rounds with error handling for unknown tools.

Solution
def assistant_loop(question, tool_defs, max_rounds=3):
    messages = [{"role": "user", "content": question}]
    known = {t["name"] for t in tool_defs}

    for rnd in range(max_rounds):
        if rnd == 0:
            resp = [{"type": "tool_use", "id": f"t_{rnd}", "name": "calculate", "input": {"expression": "42*7"}}]
            stop = "tool_use"
        else:
            resp = [{"type": "text", "text": "42 * 7 = 294."}]
            stop = "end_turn"

        if stop == "end_turn":
            return next(b["text"] for b in resp if b["type"] == "text")

        messages.append({"role": "assistant", "content": resp})
        results = []
        for b in resp:
            if b["type"] == "tool_use":
                r = run_tool(b["name"], b["input"]) if b["name"] in known else json.dumps({"error": "Unknown"})
                results.append({"type": "tool_result", "tool_use_id": b["id"], "content": r})
        messages.append({"role": "user", "content": results})
    return "Max rounds."

print(assistant_loop("42 * 7?", tools))
python
42 * 7 = 294.

Frequently Asked Questions

What’s the difference between max_tokens and budget_tokens?

max_tokens caps the visible answer. budget_tokens caps thinking. Think of it as: max_tokens = what the user sees, budget_tokens = Claude’s scratch paper. Budget must be smaller.

Can Claude call multiple tools at once?

Yes. Several tool_use blocks can appear in one response. Run them all, send results back together. Each needs its own tool_use_id.

How long does the prompt cache last?

About 5 minutes by default. Up to 1 hour for long tasks. Any change to cached content breaks the cache.

Does streaming work with tools and thinking?

Yes. Tools come as SSE tool_use blocks. Thinking comes as thinking_delta events. Same structure you learned.


Complete Code

Click to expand the full script
# Anthropic Claude API — Complete Code
# Python 3.9+
import json

headers = {
    "x-api-key": "sk-ant-your-key-here",
    "anthropic-version": "2023-06-01",
    "content-type": "application/json"
}

request = {
    "model": "claude-sonnet-4-5-20250514",
    "max_tokens": 1024,
    "system": "You are a data science tutor.",
    "messages": [{"role": "user", "content": "What is gradient descent?"}]
}

def parse_stream(events):
    text, stop, tokens = "", None, 0
    for e in events:
        if e["type"] == "content_block_delta" and e["delta"]["type"] == "text_delta":
            text += e["delta"]["text"]
        elif e["type"] == "message_delta":
            stop = e["delta"]["stop_reason"]
            tokens = e["usage"]["output_tokens"]
    return text, stop, tokens

def run_tool(name, args):
    if name == "calculate":
        try:
            return json.dumps({"result": eval(args["expression"], {"__builtins__": {}})})
        except Exception as e:
            return json.dumps({"error": str(e)})
    elif name == "lookup_concept":
        db = {"random forest": "Ensemble of trees.", "overfitting": "Memorizing noise."}
        return json.dumps({"definition": db.get(args["concept"].lower(), "Not found.")})
    return json.dumps({"error": f"Unknown: {name}"})

def process_response(content):
    texts, results = [], []
    for block in content:
        if block["type"] == "text":
            texts.append(block["text"])
        elif block["type"] == "tool_use":
            r = run_tool(block["name"], block["input"])
            results.append({"type": "tool_result", "tool_use_id": block["id"], "content": r})
    return texts, results

print("Loaded. Test: " + run_tool("calculate", {"expression": "2**10"}))

References

  1. Anthropic — Messages API Reference. Link
  2. Anthropic — Streaming Messages. Link
  3. Anthropic — Tool Use Guide. Link
  4. Anthropic — Extended Thinking. Link
  5. Anthropic — Prompt Caching. Link
  6. Anthropic — API Overview. Link
  7. Anthropic — Python SDK. Link
  8. Anthropic — Models and Pricing. Link
Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Free Callback - Limited Slots
Not Sure Which Course to Start With?
Talk to our AI Counsellors and Practitioners. We'll help you clear all your questions for your background and goals, bridging the gap between your current skills and a career in AI.
10-digit mobile number
📞
Thank You!
We'll Call You Soon!
Our learning advisor will reach out within 24 hours.
(Check your inbox too — we've sent a confirmation)
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science