Multimodal AI Tutorial: GPT-4o Vision & Audio API
Learn multimodal AI in Python with GPT-4o, Claude, and Gemini vision APIs. Build image classification, chart analysis, receipt OCR, and audio transcription with raw HTTP.
Send images, charts, receipts, and audio to GPT-4o, Claude, and Gemini — using raw HTTP calls you control.
You’ve been calling LLMs with text. But these models don’t just read — they see and hear. GPT-4o describes photos. Claude reads handwritten receipts. Gemini transcribes audio. You trigger all of it from Python with one HTTP request.
The catch? Every provider structures the JSON differently. The image field name changes. The auth method differs. Get one field wrong and you’ll stare at a 400 error for an hour.
This article fixes that. You’ll build a working multimodal pipeline across all three providers using raw HTTP. No SDKs needed.
Here’s how the pieces connect.
You start with a raw file — a JPEG photo, a PNG chart, or a WAV audio clip. Python reads it and converts the bytes to a base64-encoded string.
That base64 string goes into a JSON payload alongside your text prompt. You POST the JSON to the provider’s API endpoint.
The model processes the image (or audio) and your text together. It returns a text response that understands everything you sent. That’s multimodal AI: more than text in, text out.
We’ll cover three providers, four real tasks (photo classification, chart analysis, receipt OCR, audio transcription), and a unified function that routes to any provider.
What Is Multimodal AI?
A multimodal model accepts more than one input type in a single request. Text plus image. Text plus audio. Text plus a PDF page. It processes all inputs together and returns text.
Why does this matter? Because most real-world data isn’t text.
Think about a doctor reading an X-ray. An analyst interpreting a bar chart. An accountant scanning a receipt. Before multimodal AI, each task needed a separate model — OCR for text extraction, a classifier for images, speech-to-text for audio. Each had its own pipeline.
Multimodal LLMs collapse that into one API call. You send the image and a prompt like “What does this X-ray show?” The model returns a detailed explanation. No preprocessing. No separate models.
KEY INSIGHT: Multimodal AI doesn’t mean the model generates images or audio. It means the model accepts images, audio, or documents as input alongside text — and reasons about them together. The output is still text.
Here’s what each provider supports right now:
| Capability | GPT-4o (OpenAI) | Claude 4 Sonnet (Anthropic) | Gemini 2.5 Flash (Google) |
|---|---|---|---|
| Image input | Yes | Yes | Yes |
| Audio input | No (use Whisper) | No | Yes (native) |
| PDF input | Yes (as images) | Yes (native) | Yes (native) |
| Video input | No | No | Yes (native) |
| Image formats | JPEG, PNG, WEBP, GIF | JPEG, PNG, GIF, WEBP | JPEG, PNG, WEBP, GIF |
| Max image size | 20 MB | 20 MB | 20 MB |
Quick check: If you wanted to send an audio file to an LLM for transcription, which provider supports it natively? (Answer: Gemini. The other two need separate audio APIs.)
Setting Up: Prerequisites and Base64 Encoding
Prerequisites
- Python version: 3.9+
- Required library: requests
- Install:
pip install requests - API keys: Get them at platform.openai.com/api-keys, console.anthropic.com/settings/keys, or aistudio.google.com/apikey. Store each in an environment variable.
- Test image: Any JPEG or PNG on your machine.
- Time to complete: 30-35 minutes
Every multimodal API call starts the same way. You read a file from disk and convert it to base64. Base64 turns binary data (raw image bytes) into a text string that fits inside JSON.
Here’s a helper that handles any file type. It reads in binary mode, encodes to base64, and detects the MIME type from the file extension.
import micropip
await micropip.install(['requests'])
import os
from js import prompt
OPENAI_API_KEY = prompt("Enter your OpenAI API key:")
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
ANTHROPIC_API_KEY = prompt("Enter your Anthropic API key:")
os.environ["ANTHROPIC_API_KEY"] = ANTHROPIC_API_KEY
import requests
import base64
import os
import json
def encode_file_to_base64(file_path):
"""Read a file and return its base64 string plus MIME type."""
extension = file_path.rsplit(".", 1)[-1].lower()
mime_map = {
"jpg": "image/jpeg", "jpeg": "image/jpeg",
"png": "image/png", "gif": "image/gif",
"webp": "image/webp", "wav": "audio/wav",
"mp3": "audio/mp3", "pdf": "application/pdf"
}
mime_type = mime_map.get(extension, "application/octet-stream")
with open(file_path, "rb") as f:
encoded = base64.b64encode(f.read()).decode("utf-8")
return encoded, mime_type
Test it on any image file:
image_b64, mime = encode_file_to_base64("test_photo.jpg")
print(f"MIME type: {mime}")
print(f"Base64 length: {len(image_b64)} characters")
print(f"First 80 chars: {image_b64[:80]}...")
Your output will look something like this (exact values depend on the image):
MIME type: image/jpeg
Base64 length: 154832 characters
First 80 chars: /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAYEBQYFBAYGBQYHBwYIChAKCgkJChQODwwQ...
That /9j/ prefix is the signature of a JPEG in base64. PNG files start with iVBOR. You’ll recognize these patterns when debugging.
Sending Images to GPT-4o (OpenAI Vision)
OpenAI’s Chat Completions endpoint handles vision natively. Same URL as text calls: https://api.openai.com/v1/chat/completions. Same messages array.
The only difference: instead of a plain text content string, you pass an array of content blocks. One block for text, one for the image. The image block uses type: "image_url" with a data URL that embeds the base64 string: data:<mime>;base64,<data>.
You also set a detail parameter: "low" (512px, fast, cheap), "high" (full resolution, more tokens), or "auto".
OPENAI_KEY = os.environ.get("OPENAI_API_KEY")
def ask_gpt4o_vision(image_path, prompt, detail="auto"):
"""Send an image + text to GPT-4o. Return the response."""
img_b64, mime = encode_file_to_base64(image_path)
payload = {
"model": "gpt-4o",
"max_tokens": 1024,
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {
"url": f"data:{mime};base64,{img_b64}",
"detail": detail
}}
]
}]
}
headers = {
"Authorization": f"Bearer {OPENAI_KEY}",
"Content-Type": "application/json"
}
resp = requests.post(
"https://api.openai.com/v1/chat/completions",
headers=headers, json=payload
)
resp.raise_for_status()
return resp.json()["choices"][0]["message"]["content"]
Classify any image in one call:
result = ask_gpt4o_vision(
"test_photo.jpg",
"Describe this image in one sentence. Then classify it as: "
"photo, chart, diagram, screenshot, document, or receipt."
)
print(result)
That’s image classification with zero training data. You described what you wanted in plain English and the model delivered.
NOTE: GPT-4o vision uses the same
chat/completionsendpoint as text calls. No separate vision URL needed. The model auto-detects image blocks in the messages array. Any text-only code you’ve written works with vision — just swap the content format.
Sending Images to Claude (Anthropic Vision)
Claude’s JSON structure differs in three ways. First, the image goes in an image content block with a source object — not an image_url with a data URL. Second, auth uses x-api-key header (not Bearer). Third, you must include an anthropic-version header.
The endpoint is https://api.anthropic.com/v1/messages. Notice the source block with type: "base64", separate media_type, and raw data (no data URL prefix).
ANTHROPIC_KEY = os.environ.get("ANTHROPIC_API_KEY")
def ask_claude_vision(image_path, prompt):
"""Send an image + text to Claude. Return the response."""
img_b64, mime = encode_file_to_base64(image_path)
payload = {
"model": "claude-sonnet-4-20250514",
"max_tokens": 1024,
"messages": [{
"role": "user",
"content": [
{"type": "image", "source": {
"type": "base64",
"media_type": mime,
"data": img_b64
}},
{"type": "text", "text": prompt}
]
}]
}
headers = {
"x-api-key": ANTHROPIC_KEY,
"anthropic-version": "2023-06-01",
"Content-Type": "application/json"
}
resp = requests.post(
"https://api.anthropic.com/v1/messages",
headers=headers, json=payload
)
resp.raise_for_status()
return resp.json()["content"][0]["text"]
The response path differs too: content[0].text instead of OpenAI’s choices[0].message.content. This trips people up when switching providers.
result = ask_claude_vision(
"test_photo.jpg",
"Describe this image in one sentence. Classify it as: "
"photo, chart, diagram, screenshot, document, or receipt."
)
print(result)
Sending Images to Gemini (Google Vision)
Gemini uses a third format. The endpoint URL includes the model name: https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent. The API key goes as a query parameter (?key=...), not in the header.
Image data sits in an inline_data block with mime_type and data fields. The overall structure uses contents (not messages) with parts (not content). The nesting is the most common source of 400 errors with Gemini.
GEMINI_KEY = os.environ.get("GEMINI_API_KEY")
def ask_gemini_vision(image_path, prompt):
"""Send an image + text to Gemini. Return the response."""
img_b64, mime = encode_file_to_base64(image_path)
url = (
"https://generativelanguage.googleapis.com/v1beta/"
f"models/gemini-2.5-flash:generateContent?key={GEMINI_KEY}"
)
payload = {
"contents": [{
"parts": [
{"inline_data": {"mime_type": mime, "data": img_b64}},
{"text": prompt}
]
}]
}
resp = requests.post(url, json=payload)
resp.raise_for_status()
return resp.json()["candidates"][0]["content"]["parts"][0]["text"]
result = ask_gemini_vision(
"test_photo.jpg",
"Describe this image in one sentence. Classify it as: "
"photo, chart, diagram, screenshot, document, or receipt."
)
print(result)
Three providers, three JSON formats, three response paths. But the core idea is identical every time.
Comparing Vision Capabilities Across Providers
Before we dive into real tasks, here’s a quick-reference table. The payload formats are different, but the pattern is always the same: base64 image + text prompt in, text response out.
| Element | OpenAI (GPT-4o) | Anthropic (Claude) | Google (Gemini) |
|---|---|---|---|
| Endpoint | /v1/chat/completions |
/v1/messages |
/v1beta/models/{model}:generateContent |
| Auth | Bearer <key> header |
x-api-key header + version |
?key=<key> query param |
| Image block | image_url.url: "data:...;base64,..." |
image.source: {type: "base64", data: ...} |
inline_data: {mime_type: ..., data: ...} |
| Response path | choices[0].message.content |
content[0].text |
candidates[0].content.parts[0].text |
And here’s how they compare on real tasks:
| Task | Best Provider | Why |
|---|---|---|
| General image description | GPT-4o | Most detailed, natural phrasing |
| Chart/graph data extraction | Claude | Most accurate number reading |
| Receipt/document OCR | Claude | Best structured JSON from documents |
| Audio transcription | Gemini | Only provider with native audio |
| Speed | Gemini Flash | Consistently fastest |
| Cost per image call | Gemini Flash | ~$0.002 per image |
KEY INSIGHT: All three vision APIs use the same pattern: base64-encoded image bytes inside a JSON message alongside text. The difference is just field names and nesting. Once you understand one, the others take 10 minutes to adapt.
Predict the output: If you send a chart image to Claude using image_url format (OpenAI’s format) instead of image + source format, what happens? (Answer: a 400 error. Claude doesn’t recognize the image_url content type.)
Building a Unified Multimodal Function
Remembering three JSON formats gets tedious. Let’s build one function that handles all providers. You pass the provider name, file path, and prompt. The function picks the right format, sends the request, and extracts the response.
We’ll build it in two parts. First, the routing logic. Then we’ll add the actual HTTP calls.
def ask_multimodal(provider, file_path, prompt, **kwargs):
"""Unified multimodal call to any provider."""
file_b64, mime = encode_file_to_base64(file_path)
max_tokens = kwargs.get("max_tokens", 1024)
if provider == "openai":
return _call_openai(file_b64, mime, prompt, max_tokens, kwargs)
elif provider == "anthropic":
return _call_anthropic(file_b64, mime, prompt, max_tokens, kwargs)
elif provider == "google":
return _call_google(file_b64, mime, prompt, kwargs)
else:
raise ValueError(f"Unknown provider: {provider}")
Each internal function builds the provider-specific payload. Here’s OpenAI and Anthropic:
def _call_openai(b64, mime, prompt, max_tokens, kwargs):
detail = kwargs.get("detail", "auto")
payload = {
"model": kwargs.get("model", "gpt-4o"),
"max_tokens": max_tokens,
"messages": [{"role": "user", "content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {
"url": f"data:{mime};base64,{b64}", "detail": detail
}}
]}]
}
headers = {"Authorization": f"Bearer {OPENAI_KEY}", "Content-Type": "application/json"}
resp = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
resp.raise_for_status()
return resp.json()["choices"][0]["message"]["content"]
def _call_anthropic(b64, mime, prompt, max_tokens, kwargs):
payload = {
"model": kwargs.get("model", "claude-sonnet-4-20250514"),
"max_tokens": max_tokens,
"messages": [{"role": "user", "content": [
{"type": "image", "source": {"type": "base64", "media_type": mime, "data": b64}},
{"type": "text", "text": prompt}
]}]
}
headers = {"x-api-key": ANTHROPIC_KEY, "anthropic-version": "2023-06-01", "Content-Type": "application/json"}
resp = requests.post("https://api.anthropic.com/v1/messages", headers=headers, json=payload)
resp.raise_for_status()
return resp.json()["content"][0]["text"]
And Google:
def _call_google(b64, mime, prompt, kwargs):
model = kwargs.get("model", "gemini-2.5-flash")
url = f"https://generativelanguage.googleapis.com/v1beta/models/{model}:generateContent?key={GEMINI_KEY}"
payload = {"contents": [{"parts": [
{"inline_data": {"mime_type": mime, "data": b64}},
{"text": prompt}
]}]}
resp = requests.post(url, json=payload)
resp.raise_for_status()
return resp.json()["candidates"][0]["content"]["parts"][0]["text"]
Now every call looks the same:
result_gpt = ask_multimodal("openai", "chart.png", "Analyze this chart")
result_claude = ask_multimodal("anthropic", "receipt.jpg", "Extract receipt data as JSON")
result_gemini = ask_multimodal("google", "audio.wav", "Transcribe this audio")
{
type: ‘exercise’,
id: ‘multimodal-classify-ex1’,
title: ‘Exercise 1: Classify an Image with Two Providers’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Use the ask_multimodal() function to classify an image with both OpenAI and Anthropic. Print both results. The prompt should ask: “Classify this image as: photo, chart, document, receipt. Return ONLY the category name.”‘,
starterCode: ‘# Classify test_photo.jpg with two providers\nprompt = “Classify this image as: photo, chart, document, receipt. Return ONLY the category name.”\n\n# Call OpenAI\nresult_openai = ask_multimodal( # finish this line\n\n# Call Anthropic\nresult_anthropic = ask_multimodal( # finish this line\n\nprint(f”OpenAI: {result_openai}”)\nprint(f”Claude: {result_anthropic}”)\nprint(“DONE”)’,
testCases: [
{ id: ‘tc1’, input: ”, expectedOutput: ‘DONE’, description: ‘Both calls complete and print DONE’ },
{ id: ‘tc2’, input: ‘print(“openai” in “result_openai”)’, expectedOutput: ‘False’, hidden: true, description: ‘Variable holds a real result’ }
],
hints: [
‘First argument is the provider name: “openai” or “anthropic”‘,
‘Full call: ask_multimodal(“openai”, “test_photo.jpg”, prompt)’
],
solution: ‘prompt = “Classify this image as: photo, chart, document, receipt. Return ONLY the category name.”\nresult_openai = ask_multimodal(“openai”, “test_photo.jpg”, prompt)\nresult_anthropic = ask_multimodal(“anthropic”, “test_photo.jpg”, prompt)\nprint(f”OpenAI: {result_openai}”)\nprint(f”Claude: {result_anthropic}”)\nprint(“DONE”)’,
solutionExplanation: ‘We call ask_multimodal twice with different provider strings. The function handles JSON differences internally. Both return text.’,
xpReward: 15
}
Real Task 1: Chart and Diagram Analysis
Describing photos is a demo. The real power shows up with data. Let’s analyze a bar chart — the kind you’d export from matplotlib or a dashboard.
We’ll create a chart with matplotlib, save it to disk, then send it to GPT-4o for analysis. This way you can reproduce the exact result.
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
quarters = ["Q1", "Q2", "Q3", "Q4"]
revenue = [420, 510, 480, 620]
fig, ax = plt.subplots(figsize=(6, 4))
ax.bar(quarters, revenue, color="steelblue")
ax.set_xlabel("Quarter")
ax.set_ylabel("Revenue ($K)")
ax.set_title("2025 Quarterly Revenue")
ax.set_ylim(0, 700)
plt.tight_layout()
plt.savefig("quarterly_chart.png", dpi=100)
plt.close()
print("Chart saved to quarterly_chart.png")
Chart saved to quarterly_chart.png
We ask GPT-4o a specific business question. Not just “describe this” but “what’s the trend and which quarter grew the most?”
analysis = ask_gpt4o_vision(
"quarterly_chart.png",
"Analyze this bar chart. Answer:\n"
"1. What is the overall revenue trend?\n"
"2. Which quarter had the highest revenue?\n"
"3. Which quarter-over-quarter change was largest?\n"
"Give specific numbers from the chart."
)
print(analysis)
The model reads axis labels, estimates bar heights, and answers with numbers. It won’t be pixel-perfect — vision models approximate values. But for quick chart summarization, it’s surprisingly accurate.
Real Task 2: Receipt OCR and Data Extraction
Receipt OCR is where multimodal AI truly beats traditional pipelines. Old-school OCR (Tesseract, AWS Textract) extracts raw text but can’t understand it. A multimodal LLM reads the receipt and structures the data in one step.
The key technique: tell the model the exact JSON schema you want. This works across all three providers.
RECEIPT_PROMPT = """Extract data from this receipt as JSON.
Return ONLY valid JSON with this structure:
{
"store_name": "...",
"date": "YYYY-MM-DD",
"items": [{"name": "...", "quantity": 1, "price": 0.00}],
"subtotal": 0.00,
"tax": 0.00,
"total": 0.00,
"payment_method": "..."
}
If any field is not visible, use null."""
receipt_data = ask_gpt4o_vision("sample_receipt.jpg", RECEIPT_PROMPT)
print(receipt_data)
The model returns itemized JSON. No regex. No template matching. Just “here’s an image, give me JSON.”
Want to compare all three providers? Loop through them:
for name, func in [("GPT-4o", ask_gpt4o_vision),
("Claude", ask_claude_vision),
("Gemini", ask_gemini_vision)]:
print(f"\n{'='*40}")
print(f" {name} Receipt Extraction")
print(f"{'='*40}")
try:
print(func("sample_receipt.jpg", RECEIPT_PROMPT))
except Exception as e:
print(f"Error: {e}")
Claude tends to be the most precise with receipt details. GPT-4o is solid but sometimes rounds numbers. Gemini is fastest but occasionally adds items that aren’t on the receipt.
KEY INSIGHT: The real power isn’t raw OCR accuracy. It’s that you describe the output format in plain English, and the model both reads the document and structures the data. Traditional OCR gives raw text. Multimodal AI gives parsed, structured data.
{
type: ‘exercise’,
id: ‘multimodal-receipt-ex2’,
title: ‘Exercise 2: Build a Receipt Parser’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Write a function parse_receipt(image_path, provider="openai") that sends a receipt image to the specified provider and returns a Python dictionary with keys: store_name, total, item_count. Call ask_multimodal() with a JSON-requesting prompt, then parse with json.loads().’,
starterCode: ‘import json\n\ndef parse_receipt(image_path, provider=”openai”):\n prompt = “””Extract from this receipt. Return ONLY valid JSON:\n {“store_name”: “…”, “total”: 0.00, “item_count”: 0}”””\n \n # Call the multimodal function\n raw_response = # finish this line\n \n # Parse JSON from the response\n data = # finish this line\n \n return data\n\nresult = parse_receipt(“sample_receipt.jpg”)\nprint(type(result))\nprint(f”Store: {result.get(\”store_name\”, \”unknown\”)}”)\nprint(“DONE”)’,
testCases: [
{ id: ‘tc1’, input: ”, expectedOutput: ‘DONE’, description: ‘Function completes and prints DONE’ },
{ id: ‘tc2’, input: ‘print(type(parse_receipt(“sample_receipt.jpg”)).name)’, expectedOutput: ‘dict’, description: ‘Returns a dictionary’ }
],
hints: [
‘Use ask_multimodal(provider, image_path, prompt) for the API call’,
‘Parse the response: data = json.loads(raw_response)’
],
solution: ‘import json\n\ndef parse_receipt(image_path, provider=”openai”):\n prompt = “””Extract from this receipt. Return ONLY valid JSON:\n {“store_name”: “…”, “total”: 0.00, “item_count”: 0}”””\n raw_response = ask_multimodal(provider, image_path, prompt)\n data = json.loads(raw_response)\n return data\n\nresult = parse_receipt(“sample_receipt.jpg”)\nprint(type(result))\nprint(f”Store: {result.get(\”store_name\”, \”unknown\”)}”)\nprint(“DONE”)’,
solutionExplanation: ‘We pass provider and image path to ask_multimodal() with a JSON-requesting prompt. The model returns a JSON string. json.loads() converts it to a dict. The structured prompt is the key technique.’,
xpReward: 20
}
Real Task 3: Audio Transcription with Gemini
Here’s where Gemini pulls ahead. GPT-4o and Claude don’t accept audio in their chat endpoints. You’d need OpenAI’s Whisper API or a third-party service. Gemini handles audio natively — same endpoint, same structure.
Our encode_file_to_base64 function already handles audio files. Just pass a .wav or .mp3 path. Gemini accepts WAV, MP3, AIFF, AAC, OGG, and FLAC.
The audio goes into the same inline_data block as images — just with an audio MIME type.
def transcribe_audio_gemini(audio_path, prompt=None):
"""Transcribe audio using Gemini's native support."""
if prompt is None:
prompt = (
"Transcribe this audio exactly. "
"Include speaker labels if multiple speakers."
)
audio_b64, mime = encode_file_to_base64(audio_path)
url = (
"https://generativelanguage.googleapis.com/v1beta/"
f"models/gemini-2.5-flash:generateContent?key={GEMINI_KEY}"
)
payload = {"contents": [{"parts": [
{"inline_data": {"mime_type": mime, "data": audio_b64}},
{"text": prompt}
]}]}
resp = requests.post(url, json=payload)
resp.raise_for_status()
return resp.json()["candidates"][0]["content"]["parts"][0]["text"]
Transcribe any audio clip:
transcript = transcribe_audio_gemini("meeting_clip.wav")
print(transcript)
You can go beyond transcription. Ask Gemini to summarize, extract action items, or detect the language:
summary = transcribe_audio_gemini(
"meeting_clip.wav",
"Listen to this audio. Provide:\n"
"1. A 2-sentence summary\n"
"2. Key action items\n"
"3. Overall tone (formal, casual, urgent)"
)
print(summary)
Sending Multiple Images in One Request
All three providers support multiple images per request. Add more image blocks to the content array. This enables image comparison and batch analysis.
Here’s how to compare two images with GPT-4o. We add two image_url blocks in the same message:
def compare_images_gpt4o(path_1, path_2, prompt):
"""Send two images to GPT-4o for comparison."""
img1_b64, mime1 = encode_file_to_base64(path_1)
img2_b64, mime2 = encode_file_to_base64(path_2)
payload = {
"model": "gpt-4o", "max_tokens": 1024,
"messages": [{"role": "user", "content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {
"url": f"data:{mime1};base64,{img1_b64}", "detail": "auto"
}},
{"type": "image_url", "image_url": {
"url": f"data:{mime2};base64,{img2_b64}", "detail": "auto"
}}
]}]
}
headers = {"Authorization": f"Bearer {OPENAI_KEY}", "Content-Type": "application/json"}
resp = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
resp.raise_for_status()
return resp.json()["choices"][0]["message"]["content"]
A practical use: comparing dashboard versions or design mockups.
comparison = compare_images_gpt4o(
"dashboard_v1.png", "dashboard_v2.png",
"Compare these dashboards. List specific differences."
)
print(comparison)
Error Handling for Multimodal Calls
Multimodal requests fail more often than text calls. Payloads are larger, timeout risks are higher, and each provider returns errors differently. Here’s a wrapper with retry logic and provider-specific error handling.
import time
def ask_multimodal_safe(provider, file_path, prompt, max_retries=3):
"""Multimodal call with retries and error handling."""
for attempt in range(max_retries):
try:
return ask_multimodal(provider, file_path, prompt)
except requests.exceptions.HTTPError as e:
status = e.response.status_code
if status == 429:
wait = 2 ** attempt
print(f"Rate limited. Waiting {wait}s...")
time.sleep(wait)
continue
elif status == 413:
print(f"File too large for {provider}.")
return None
else:
print(f"HTTP {status}: {e.response.text[:200]}")
return None
except requests.exceptions.Timeout:
time.sleep(2 ** attempt)
continue
print(f"Failed after {max_retries} attempts.")
return None
Use it the same way:
result = ask_multimodal_safe("openai", "large_image.jpg", "Describe this")
if result:
print(result)
else:
print("Call failed. Check errors above.")
When NOT to Use Multimodal LLMs
Multimodal AI is powerful but not always the right tool. Here are cases where you should reach for something else.
High-volume OCR (thousands of documents). LLM vision calls cost $0.01-0.05 per image. Processing 10,000 invoices costs $100-500. Traditional OCR services (AWS Textract, Google Document AI) cost a fraction. Use the LLM only when you need understanding, not just extraction.
Real-time video analysis. Vision APIs process one frame at a time. At 30 fps, you’d make 30 calls per second. That’s expensive and too slow. Use dedicated CV models (YOLO, MediaPipe) for real-time work.
Medical or legal decisions. Vision LLMs hallucinate. A chart value of 420 might be read as 410. Use these models as assistants that flag items for human review — never as the sole decision maker.
Pixel-precise measurements. Need exact coordinates or bounding boxes? Use a dedicated CV model. LLMs describe what they see in language. They don’t reliably return coordinates.
Common Mistakes and How to Fix Them
Mistake 1: No max_tokens on vision calls
Wrong:
payload = {"model": "gpt-4o", "messages": [{"role": "user", "content": [...]}]}
Why: OpenAI vision calls sometimes default to very short outputs (as low as 16 tokens). Your response gets cut off.
Fix:
payload = {"model": "gpt-4o", "max_tokens": 1024, "messages": [...]}
Mistake 2: Data URL prefix in Claude’s data field
Wrong:
"data": f"data:{mime};base64,{img_b64}" # WRONG for Claude
Why: Claude expects raw base64 in data. Adding the data URL prefix causes a 400 error. OpenAI uses the prefix. Claude does not.
Fix:
"data": img_b64 # Raw base64 string, no prefix
Mistake 3: Header auth for Gemini
Wrong:
headers = {"Authorization": f"Bearer {GEMINI_KEY}"} # Nope
Why: Gemini uses a query parameter for auth, not a header. OpenAI/Claude use headers. Mixing them up causes silent 401 or 403 errors.
Fix: Put the key in the URL: ?key={GEMINI_KEY}
Summary and Practice Exercise
You built a complete multimodal pipeline from scratch:
- Base64 encoding — converting images and audio to JSON-safe strings
- Three vision APIs — GPT-4o, Claude, Gemini with different JSON but the same core pattern
- Four real tasks — classification, chart analysis, receipt OCR, audio transcription
- A unified wrapper — one function for all providers
- Error handling — retries, rate limits, payload checks
The field moves fast. OpenAI is adding native audio. Anthropic is expanding PDF support. Google keeps adding modalities. But the base64-in-JSON pattern you learned here carries forward.
Complete Code
Frequently Asked Questions
Can I send a PDF directly to these APIs?
Claude and Gemini accept PDF files natively. Base64-encode the PDF and set MIME type to application/pdf. GPT-4o doesn’t support PDF directly — convert each page to an image first using pdf2image, then send the page images.
How much does a vision API call cost?
GPT-4o charges by resolution: ~$0.0075 for 512×512 “low detail,” up to ~$0.03 for high-resolution. Claude charges per image token (~$0.01-0.04 per image). Gemini 2.5 Flash is cheapest at ~$0.002 per image. Audio on Gemini costs ~$0.001 per 15 seconds.
Can these models read handwriting?
Yes, but accuracy depends on legibility. Printed text hits 95%+ across all three. Neat handwriting drops to 85-90%. Messy handwriting sits around 60-75%. Claude tends to perform best on handwritten documents.
URL vs base64 — which should I use?
With a URL, the API fetches the image from the internet. With base64, you embed it in the request. Use URLs for public images (faster, smaller payload). Use base64 for local or private files.
How do I handle images larger than 20 MB?
Resize before encoding. Use Pillow: img = Image.open("huge.jpg"); img.thumbnail((2048, 2048)); img.save("resized.jpg", quality=85). This gets most photos under 1 MB.
References
- OpenAI documentation — Images and Vision. Link
- Anthropic documentation — Vision. Link
- Google AI documentation — Image understanding. Link
- Google AI documentation — Audio understanding. Link
- OpenAI documentation — Chat Completions API. Link
- Anthropic documentation — Messages API. Link
- Google AI documentation — Gemini API reference. Link
- OpenAI Cookbook — GPT-4o with Vision. Link
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →