Multimodal AI Tutorial: GPT-4o Vision & Audio API

Learn multimodal AI in Python with GPT-4o, Claude, and Gemini vision APIs. Build image classification, chart analysis, receipt OCR, and audio transcription with raw HTTP.

Written by Selva Prabhakaran | 28 min read

Send images, charts, receipts, and audio to GPT-4o, Claude, and Gemini — using raw HTTP calls you control.

⚡ This post has interactive code — click ▶ Run or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.

You’ve been calling LLMs with text. But these models don’t just read — they see and hear. GPT-4o describes photos. Claude reads handwritten receipts. Gemini transcribes audio. You trigger all of it from Python with one HTTP request.

The catch? Every provider structures the JSON differently. The image field name changes. The auth method differs. Get one field wrong and you’ll stare at a 400 error for an hour.

This article fixes that. You’ll build a working multimodal pipeline across all three providers using raw HTTP. No SDKs needed.

Here’s how the pieces connect.

You start with a raw file — a JPEG photo, a PNG chart, or a WAV audio clip. Python reads it and converts the bytes to a base64-encoded string.

That base64 string goes into a JSON payload alongside your text prompt. You POST the JSON to the provider’s API endpoint.

The model processes the image (or audio) and your text together. It returns a text response that understands everything you sent. That’s multimodal AI: more than text in, text out.

We’ll cover three providers, four real tasks (photo classification, chart analysis, receipt OCR, audio transcription), and a unified function that routes to any provider.

What Is Multimodal AI?

A multimodal model accepts more than one input type in a single request. Text plus image. Text plus audio. Text plus a PDF page. It processes all inputs together and returns text.

Why does this matter? Because most real-world data isn’t text.

Think about a doctor reading an X-ray. An analyst interpreting a bar chart. An accountant scanning a receipt. Before multimodal AI, each task needed a separate model — OCR for text extraction, a classifier for images, speech-to-text for audio. Each had its own pipeline.

Multimodal LLMs collapse that into one API call. You send the image and a prompt like “What does this X-ray show?” The model returns a detailed explanation. No preprocessing. No separate models.

KEY INSIGHT: Multimodal AI doesn’t mean the model generates images or audio. It means the model accepts images, audio, or documents as input alongside text — and reasons about them together. The output is still text.

Here’s what each provider supports right now:

Capability	GPT-4o (OpenAI)	Claude 4 Sonnet (Anthropic)	Gemini 2.5 Flash (Google)
Image input	Yes	Yes	Yes
Audio input	No (use Whisper)	No	Yes (native)
PDF input	Yes (as images)	Yes (native)	Yes (native)
Video input	No	No	Yes (native)
Image formats	JPEG, PNG, WEBP, GIF	JPEG, PNG, GIF, WEBP	JPEG, PNG, WEBP, GIF
Max image size	20 MB	20 MB	20 MB

Quick check: If you wanted to send an audio file to an LLM for transcription, which provider supports it natively? (Answer: Gemini. The other two need separate audio APIs.)

Setting Up: Prerequisites and Base64 Encoding

Prerequisites

Python version: 3.9+
Required library: requests
Install: pip install requests
API keys: Get them at platform.openai.com/api-keys, console.anthropic.com/settings/keys, or aistudio.google.com/apikey. Store each in an environment variable.
Test image: Any JPEG or PNG on your machine.
Time to complete: 30-35 minutes

Every multimodal API call starts the same way. You read a file from disk and convert it to base64. Base64 turns binary data (raw image bytes) into a text string that fits inside JSON.

Here’s a helper that handles any file type. It reads in binary mode, encodes to base64, and detects the MIME type from the file extension.

import micropip
await micropip.install(['requests'])

import os
from js import prompt
OPENAI_API_KEY = prompt("Enter your OpenAI API key:")
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
ANTHROPIC_API_KEY = prompt("Enter your Anthropic API key:")
os.environ["ANTHROPIC_API_KEY"] = ANTHROPIC_API_KEY

import requests
import base64
import os
import json

def encode_file_to_base64(file_path):
    """Read a file and return its base64 string plus MIME type."""
    extension = file_path.rsplit(".", 1)[-1].lower()
    mime_map = {
        "jpg": "image/jpeg", "jpeg": "image/jpeg",
        "png": "image/png", "gif": "image/gif",
        "webp": "image/webp", "wav": "audio/wav",
        "mp3": "audio/mp3", "pdf": "application/pdf"
    }
    mime_type = mime_map.get(extension, "application/octet-stream")
    with open(file_path, "rb") as f:
        encoded = base64.b64encode(f.read()).decode("utf-8")
    return encoded, mime_type

Test it on any image file:

image_b64, mime = encode_file_to_base64("test_photo.jpg")
print(f"MIME type: {mime}")
print(f"Base64 length: {len(image_b64)} characters")
print(f"First 80 chars: {image_b64[:80]}...")

Your output will look something like this (exact values depend on the image):

python

MIME type: image/jpeg
Base64 length: 154832 characters
First 80 chars: /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAYEBQYFBAYGBQYHBwYIChAKCgkJChQODwwQ...

That /9j/ prefix is the signature of a JPEG in base64. PNG files start with iVBOR. You’ll recognize these patterns when debugging.

Tip: Keep your base64 strings small. A 5 MB JPEG becomes ~6.7 MB in base64 (33% larger). All three providers cap payloads at 20 MB. For faster responses and lower costs, resize images to under 1 MB before encoding. GPT-4o’s “low detail” mode automatically downscales to 512×512 anyway.

Sending Images to GPT-4o (OpenAI Vision)

OpenAI’s Chat Completions endpoint handles vision natively. Same URL as text calls: https://api.openai.com/v1/chat/completions. Same messages array.

The only difference: instead of a plain text content string, you pass an array of content blocks. One block for text, one for the image. The image block uses type: "image_url" with a data URL that embeds the base64 string: data:<mime>;base64,<data>.

You also set a detail parameter: "low" (512px, fast, cheap), "high" (full resolution, more tokens), or "auto".

OPENAI_KEY = os.environ.get("OPENAI_API_KEY")

def ask_gpt4o_vision(image_path, prompt, detail="auto"):
    """Send an image + text to GPT-4o. Return the response."""
    img_b64, mime = encode_file_to_base64(image_path)

    payload = {
        "model": "gpt-4o",
        "max_tokens": 1024,
        "messages": [{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {
                    "url": f"data:{mime};base64,{img_b64}",
                    "detail": detail
                }}
            ]
        }]
    }

    headers = {
        "Authorization": f"Bearer {OPENAI_KEY}",
        "Content-Type": "application/json"
    }

    resp = requests.post(
        "https://api.openai.com/v1/chat/completions",
        headers=headers, json=payload
    )
    resp.raise_for_status()
    return resp.json()["choices"][0]["message"]["content"]

Classify any image in one call:

result = ask_gpt4o_vision(
    "test_photo.jpg",
    "Describe this image in one sentence. Then classify it as: "
    "photo, chart, diagram, screenshot, document, or receipt."
)
print(result)

That’s image classification with zero training data. You described what you wanted in plain English and the model delivered.

NOTE: GPT-4o vision uses the same chat/completions endpoint as text calls. No separate vision URL needed. The model auto-detects image blocks in the messages array. Any text-only code you’ve written works with vision — just swap the content format.

Sending Images to Claude (Anthropic Vision)

Claude’s JSON structure differs in three ways. First, the image goes in an image content block with a source object — not an image_url with a data URL. Second, auth uses x-api-key header (not Bearer). Third, you must include an anthropic-version header.

The endpoint is https://api.anthropic.com/v1/messages. Notice the source block with type: "base64", separate media_type, and raw data (no data URL prefix).

ANTHROPIC_KEY = os.environ.get("ANTHROPIC_API_KEY")

def ask_claude_vision(image_path, prompt):
    """Send an image + text to Claude. Return the response."""
    img_b64, mime = encode_file_to_base64(image_path)

    payload = {
        "model": "claude-sonnet-4-20250514",
        "max_tokens": 1024,
        "messages": [{
            "role": "user",
            "content": [
                {"type": "image", "source": {
                    "type": "base64",
                    "media_type": mime,
                    "data": img_b64
                }},
                {"type": "text", "text": prompt}
            ]
        }]
    }

    headers = {
        "x-api-key": ANTHROPIC_KEY,
        "anthropic-version": "2023-06-01",
        "Content-Type": "application/json"
    }

    resp = requests.post(
        "https://api.anthropic.com/v1/messages",
        headers=headers, json=payload
    )
    resp.raise_for_status()
    return resp.json()["content"][0]["text"]

The response path differs too: content[0].text instead of OpenAI’s choices[0].message.content. This trips people up when switching providers.

result = ask_claude_vision(
    "test_photo.jpg",
    "Describe this image in one sentence. Classify it as: "
    "photo, chart, diagram, screenshot, document, or receipt."
)
print(result)

Warning: Claude requires `anthropic-version` in every request. Forget it and you get a 400 error about missing headers. The current stable version string is `”2023-06-01″`. This catches many first-time Claude API users.

Sending Images to Gemini (Google Vision)

Gemini uses a third format. The endpoint URL includes the model name: https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent. The API key goes as a query parameter (?key=...), not in the header.

Image data sits in an inline_data block with mime_type and data fields. The overall structure uses contents (not messages) with parts (not content). The nesting is the most common source of 400 errors with Gemini.

GEMINI_KEY = os.environ.get("GEMINI_API_KEY")

def ask_gemini_vision(image_path, prompt):
    """Send an image + text to Gemini. Return the response."""
    img_b64, mime = encode_file_to_base64(image_path)

    url = (
        "https://generativelanguage.googleapis.com/v1beta/"
        f"models/gemini-2.5-flash:generateContent?key={GEMINI_KEY}"
    )

    payload = {
        "contents": [{
            "parts": [
                {"inline_data": {"mime_type": mime, "data": img_b64}},
                {"text": prompt}
            ]
        }]
    }

    resp = requests.post(url, json=payload)
    resp.raise_for_status()
    return resp.json()["candidates"][0]["content"]["parts"][0]["text"]

result = ask_gemini_vision(
    "test_photo.jpg",
    "Describe this image in one sentence. Classify it as: "
    "photo, chart, diagram, screenshot, document, or receipt."
)
print(result)

Three providers, three JSON formats, three response paths. But the core idea is identical every time.

Comparing Vision Capabilities Across Providers

Before we dive into real tasks, here’s a quick-reference table. The payload formats are different, but the pattern is always the same: base64 image + text prompt in, text response out.

Element	OpenAI (GPT-4o)	Anthropic (Claude)	Google (Gemini)
Endpoint	`/v1/chat/completions`	`/v1/messages`	`/v1beta/models/{model}:generateContent`
Auth	`Bearer <key>` header	`x-api-key` header + version	`?key=<key>` query param
Image block	`image_url.url: "data:...;base64,..."`	`image.source: {type: "base64", data: ...}`	`inline_data: {mime_type: ..., data: ...}`
Response path	`choices[0].message.content`	`content[0].text`	`candidates[0].content.parts[0].text`

And here’s how they compare on real tasks:

Task	Best Provider	Why
General image description	GPT-4o	Most detailed, natural phrasing
Chart/graph data extraction	Claude	Most accurate number reading
Receipt/document OCR	Claude	Best structured JSON from documents
Audio transcription	Gemini	Only provider with native audio
Speed	Gemini Flash	Consistently fastest
Cost per image call	Gemini Flash	~$0.002 per image

KEY INSIGHT: All three vision APIs use the same pattern: base64-encoded image bytes inside a JSON message alongside text. The difference is just field names and nesting. Once you understand one, the others take 10 minutes to adapt.

Predict the output: If you send a chart image to Claude using image_url format (OpenAI’s format) instead of image + source format, what happens? (Answer: a 400 error. Claude doesn’t recognize the image_url content type.)

Building a Unified Multimodal Function

Remembering three JSON formats gets tedious. Let’s build one function that handles all providers. You pass the provider name, file path, and prompt. The function picks the right format, sends the request, and extracts the response.

We’ll build it in two parts. First, the routing logic. Then we’ll add the actual HTTP calls.

def ask_multimodal(provider, file_path, prompt, **kwargs):
    """Unified multimodal call to any provider."""
    file_b64, mime = encode_file_to_base64(file_path)
    max_tokens = kwargs.get("max_tokens", 1024)

    if provider == "openai":
        return _call_openai(file_b64, mime, prompt, max_tokens, kwargs)
    elif provider == "anthropic":
        return _call_anthropic(file_b64, mime, prompt, max_tokens, kwargs)
    elif provider == "google":
        return _call_google(file_b64, mime, prompt, kwargs)
    else:
        raise ValueError(f"Unknown provider: {provider}")

Each internal function builds the provider-specific payload. Here’s OpenAI and Anthropic:

def _call_openai(b64, mime, prompt, max_tokens, kwargs):
    detail = kwargs.get("detail", "auto")
    payload = {
        "model": kwargs.get("model", "gpt-4o"),
        "max_tokens": max_tokens,
        "messages": [{"role": "user", "content": [
            {"type": "text", "text": prompt},
            {"type": "image_url", "image_url": {
                "url": f"data:{mime};base64,{b64}", "detail": detail
            }}
        ]}]
    }
    headers = {"Authorization": f"Bearer {OPENAI_KEY}", "Content-Type": "application/json"}
    resp = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
    resp.raise_for_status()
    return resp.json()["choices"][0]["message"]["content"]

def _call_anthropic(b64, mime, prompt, max_tokens, kwargs):
    payload = {
        "model": kwargs.get("model", "claude-sonnet-4-20250514"),
        "max_tokens": max_tokens,
        "messages": [{"role": "user", "content": [
            {"type": "image", "source": {"type": "base64", "media_type": mime, "data": b64}},
            {"type": "text", "text": prompt}
        ]}]
    }
    headers = {"x-api-key": ANTHROPIC_KEY, "anthropic-version": "2023-06-01", "Content-Type": "application/json"}
    resp = requests.post("https://api.anthropic.com/v1/messages", headers=headers, json=payload)
    resp.raise_for_status()
    return resp.json()["content"][0]["text"]

And Google:

def _call_google(b64, mime, prompt, kwargs):
    model = kwargs.get("model", "gemini-2.5-flash")
    url = f"https://generativelanguage.googleapis.com/v1beta/models/{model}:generateContent?key={GEMINI_KEY}"
    payload = {"contents": [{"parts": [
        {"inline_data": {"mime_type": mime, "data": b64}},
        {"text": prompt}
    ]}]}
    resp = requests.post(url, json=payload)
    resp.raise_for_status()
    return resp.json()["candidates"][0]["content"]["parts"][0]["text"]

Now every call looks the same:

result_gpt = ask_multimodal("openai", "chart.png", "Analyze this chart")
result_claude = ask_multimodal("anthropic", "receipt.jpg", "Extract receipt data as JSON")
result_gemini = ask_multimodal("google", "audio.wav", "Transcribe this audio")

{
type: ‘exercise’,
id: ‘multimodal-classify-ex1’,
title: ‘Exercise 1: Classify an Image with Two Providers’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Use the ask_multimodal() function to classify an image with both OpenAI and Anthropic. Print both results. The prompt should ask: “Classify this image as: photo, chart, document, receipt. Return ONLY the category name.”‘,
starterCode: ‘# Classify test_photo.jpg with two providers\nprompt = “Classify this image as: photo, chart, document, receipt. Return ONLY the category name.”\n\n# Call OpenAI\nresult_openai = ask_multimodal( # finish this line\n\n# Call Anthropic\nresult_anthropic = ask_multimodal( # finish this line\n\nprint(f”OpenAI: {result_openai}”)\nprint(f”Claude: {result_anthropic}”)\nprint(“DONE”)’,
testCases: [
{ id: ‘tc1’, input: ”, expectedOutput: ‘DONE’, description: ‘Both calls complete and print DONE’ },
{ id: ‘tc2’, input: ‘print(“openai” in “result_openai”)’, expectedOutput: ‘False’, hidden: true, description: ‘Variable holds a real result’ }
],
hints: [
‘First argument is the provider name: “openai” or “anthropic”‘,
‘Full call: ask_multimodal(“openai”, “test_photo.jpg”, prompt)’
],
solution: ‘prompt = “Classify this image as: photo, chart, document, receipt. Return ONLY the category name.”\nresult_openai = ask_multimodal(“openai”, “test_photo.jpg”, prompt)\nresult_anthropic = ask_multimodal(“anthropic”, “test_photo.jpg”, prompt)\nprint(f”OpenAI: {result_openai}”)\nprint(f”Claude: {result_anthropic}”)\nprint(“DONE”)’,
solutionExplanation: ‘We call ask_multimodal twice with different provider strings. The function handles JSON differences internally. Both return text.’,
xpReward: 15
}

Real Task 1: Chart and Diagram Analysis

Describing photos is a demo. The real power shows up with data. Let’s analyze a bar chart — the kind you’d export from matplotlib or a dashboard.

We’ll create a chart with matplotlib, save it to disk, then send it to GPT-4o for analysis. This way you can reproduce the exact result.

import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt

quarters = ["Q1", "Q2", "Q3", "Q4"]
revenue = [420, 510, 480, 620]

fig, ax = plt.subplots(figsize=(6, 4))
ax.bar(quarters, revenue, color="steelblue")
ax.set_xlabel("Quarter")
ax.set_ylabel("Revenue ($K)")
ax.set_title("2025 Quarterly Revenue")
ax.set_ylim(0, 700)
plt.tight_layout()
plt.savefig("quarterly_chart.png", dpi=100)
plt.close()
print("Chart saved to quarterly_chart.png")

python

Chart saved to quarterly_chart.png

We ask GPT-4o a specific business question. Not just “describe this” but “what’s the trend and which quarter grew the most?”

analysis = ask_gpt4o_vision(
    "quarterly_chart.png",
    "Analyze this bar chart. Answer:\n"
    "1. What is the overall revenue trend?\n"
    "2. Which quarter had the highest revenue?\n"
    "3. Which quarter-over-quarter change was largest?\n"
    "Give specific numbers from the chart."
)
print(analysis)

The model reads axis labels, estimates bar heights, and answers with numbers. It won’t be pixel-perfect — vision models approximate values. But for quick chart summarization, it’s surprisingly accurate.

Tip: For precise data extraction, add structure to your prompt. Instead of “analyze this chart,” try: “Extract data as JSON with keys for each category and numeric values.” Structured prompts get structured output.

Real Task 2: Receipt OCR and Data Extraction

Receipt OCR is where multimodal AI truly beats traditional pipelines. Old-school OCR (Tesseract, AWS Textract) extracts raw text but can’t understand it. A multimodal LLM reads the receipt and structures the data in one step.

The key technique: tell the model the exact JSON schema you want. This works across all three providers.

RECEIPT_PROMPT = """Extract data from this receipt as JSON.
Return ONLY valid JSON with this structure:
{
    "store_name": "...",
    "date": "YYYY-MM-DD",
    "items": [{"name": "...", "quantity": 1, "price": 0.00}],
    "subtotal": 0.00,
    "tax": 0.00,
    "total": 0.00,
    "payment_method": "..."
}
If any field is not visible, use null."""

receipt_data = ask_gpt4o_vision("sample_receipt.jpg", RECEIPT_PROMPT)
print(receipt_data)

The model returns itemized JSON. No regex. No template matching. Just “here’s an image, give me JSON.”

Want to compare all three providers? Loop through them:

for name, func in [("GPT-4o", ask_gpt4o_vision),
                    ("Claude", ask_claude_vision),
                    ("Gemini", ask_gemini_vision)]:
    print(f"\n{'='*40}")
    print(f"  {name} Receipt Extraction")
    print(f"{'='*40}")
    try:
        print(func("sample_receipt.jpg", RECEIPT_PROMPT))
    except Exception as e:
        print(f"Error: {e}")

Claude tends to be the most precise with receipt details. GPT-4o is solid but sometimes rounds numbers. Gemini is fastest but occasionally adds items that aren’t on the receipt.

KEY INSIGHT: The real power isn’t raw OCR accuracy. It’s that you describe the output format in plain English, and the model both reads the document and structures the data. Traditional OCR gives raw text. Multimodal AI gives parsed, structured data.

{
type: ‘exercise’,
id: ‘multimodal-receipt-ex2’,
title: ‘Exercise 2: Build a Receipt Parser’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Write a function parse_receipt(image_path, provider="openai") that sends a receipt image to the specified provider and returns a Python dictionary with keys: store_name, total, item_count. Call ask_multimodal() with a JSON-requesting prompt, then parse with json.loads().’,
starterCode: ‘import json\n\ndef parse_receipt(image_path, provider=”openai”):\n prompt = “””Extract from this receipt. Return ONLY valid JSON:\n {“store_name”: “…”, “total”: 0.00, “item_count”: 0}”””\n \n # Call the multimodal function\n raw_response = # finish this line\n \n # Parse JSON from the response\n data = # finish this line\n \n return data\n\nresult = parse_receipt(“sample_receipt.jpg”)\nprint(type(result))\nprint(f”Store: {result.get(\”store_name\”, \”unknown\”)}”)\nprint(“DONE”)’,
testCases: [
{ id: ‘tc1’, input: ”, expectedOutput: ‘DONE’, description: ‘Function completes and prints DONE’ },
{ id: ‘tc2’, input: ‘print(type(parse_receipt(“sample_receipt.jpg”)).name)’, expectedOutput: ‘dict’, description: ‘Returns a dictionary’ }
],
hints: [
‘Use ask_multimodal(provider, image_path, prompt) for the API call’,
‘Parse the response: data = json.loads(raw_response)’
],
solution: ‘import json\n\ndef parse_receipt(image_path, provider=”openai”):\n prompt = “””Extract from this receipt. Return ONLY valid JSON:\n {“store_name”: “…”, “total”: 0.00, “item_count”: 0}”””\n raw_response = ask_multimodal(provider, image_path, prompt)\n data = json.loads(raw_response)\n return data\n\nresult = parse_receipt(“sample_receipt.jpg”)\nprint(type(result))\nprint(f”Store: {result.get(\”store_name\”, \”unknown\”)}”)\nprint(“DONE”)’,
solutionExplanation: ‘We pass provider and image path to ask_multimodal() with a JSON-requesting prompt. The model returns a JSON string. json.loads() converts it to a dict. The structured prompt is the key technique.’,
xpReward: 20
}

Real Task 3: Audio Transcription with Gemini

Here’s where Gemini pulls ahead. GPT-4o and Claude don’t accept audio in their chat endpoints. You’d need OpenAI’s Whisper API or a third-party service. Gemini handles audio natively — same endpoint, same structure.

Our encode_file_to_base64 function already handles audio files. Just pass a .wav or .mp3 path. Gemini accepts WAV, MP3, AIFF, AAC, OGG, and FLAC.

The audio goes into the same inline_data block as images — just with an audio MIME type.

def transcribe_audio_gemini(audio_path, prompt=None):
    """Transcribe audio using Gemini's native support."""
    if prompt is None:
        prompt = (
            "Transcribe this audio exactly. "
            "Include speaker labels if multiple speakers."
        )

    audio_b64, mime = encode_file_to_base64(audio_path)
    url = (
        "https://generativelanguage.googleapis.com/v1beta/"
        f"models/gemini-2.5-flash:generateContent?key={GEMINI_KEY}"
    )
    payload = {"contents": [{"parts": [
        {"inline_data": {"mime_type": mime, "data": audio_b64}},
        {"text": prompt}
    ]}]}

    resp = requests.post(url, json=payload)
    resp.raise_for_status()
    return resp.json()["candidates"][0]["content"]["parts"][0]["text"]

Transcribe any audio clip:

transcript = transcribe_audio_gemini("meeting_clip.wav")
print(transcript)

You can go beyond transcription. Ask Gemini to summarize, extract action items, or detect the language:

summary = transcribe_audio_gemini(
    "meeting_clip.wav",
    "Listen to this audio. Provide:\n"
    "1. A 2-sentence summary\n"
    "2. Key action items\n"
    "3. Overall tone (formal, casual, urgent)"
)
print(summary)

Warning: Audio files get large fast in base64. A 1-minute WAV at 44.1 kHz stereo is ~10 MB raw and ~13 MB in base64. Gemini’s inline limit is 20 MB total. For longer recordings, compress to MP3 first or use Gemini’s File Upload API (supports up to 2 GB).

Sending Multiple Images in One Request

All three providers support multiple images per request. Add more image blocks to the content array. This enables image comparison and batch analysis.

Here’s how to compare two images with GPT-4o. We add two image_url blocks in the same message:

def compare_images_gpt4o(path_1, path_2, prompt):
    """Send two images to GPT-4o for comparison."""
    img1_b64, mime1 = encode_file_to_base64(path_1)
    img2_b64, mime2 = encode_file_to_base64(path_2)

    payload = {
        "model": "gpt-4o", "max_tokens": 1024,
        "messages": [{"role": "user", "content": [
            {"type": "text", "text": prompt},
            {"type": "image_url", "image_url": {
                "url": f"data:{mime1};base64,{img1_b64}", "detail": "auto"
            }},
            {"type": "image_url", "image_url": {
                "url": f"data:{mime2};base64,{img2_b64}", "detail": "auto"
            }}
        ]}]
    }

    headers = {"Authorization": f"Bearer {OPENAI_KEY}", "Content-Type": "application/json"}
    resp = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
    resp.raise_for_status()
    return resp.json()["choices"][0]["message"]["content"]

A practical use: comparing dashboard versions or design mockups.

comparison = compare_images_gpt4o(
    "dashboard_v1.png", "dashboard_v2.png",
    "Compare these dashboards. List specific differences."
)
print(comparison)

Error Handling for Multimodal Calls

Multimodal requests fail more often than text calls. Payloads are larger, timeout risks are higher, and each provider returns errors differently. Here’s a wrapper with retry logic and provider-specific error handling.

import time

def ask_multimodal_safe(provider, file_path, prompt, max_retries=3):
    """Multimodal call with retries and error handling."""
    for attempt in range(max_retries):
        try:
            return ask_multimodal(provider, file_path, prompt)

        except requests.exceptions.HTTPError as e:
            status = e.response.status_code
            if status == 429:
                wait = 2 ** attempt
                print(f"Rate limited. Waiting {wait}s...")
                time.sleep(wait)
                continue
            elif status == 413:
                print(f"File too large for {provider}.")
                return None
            else:
                print(f"HTTP {status}: {e.response.text[:200]}")
                return None

        except requests.exceptions.Timeout:
            time.sleep(2 ** attempt)
            continue

    print(f"Failed after {max_retries} attempts.")
    return None

Use it the same way:

result = ask_multimodal_safe("openai", "large_image.jpg", "Describe this")
if result:
    print(result)
else:
    print("Call failed. Check errors above.")

When NOT to Use Multimodal LLMs

Multimodal AI is powerful but not always the right tool. Here are cases where you should reach for something else.

High-volume OCR (thousands of documents). LLM vision calls cost $0.01-0.05 per image. Processing 10,000 invoices costs $100-500. Traditional OCR services (AWS Textract, Google Document AI) cost a fraction. Use the LLM only when you need understanding, not just extraction.

Real-time video analysis. Vision APIs process one frame at a time. At 30 fps, you’d make 30 calls per second. That’s expensive and too slow. Use dedicated CV models (YOLO, MediaPipe) for real-time work.

Medical or legal decisions. Vision LLMs hallucinate. A chart value of 420 might be read as 410. Use these models as assistants that flag items for human review — never as the sole decision maker.

Pixel-precise measurements. Need exact coordinates or bounding boxes? Use a dedicated CV model. LLMs describe what they see in language. They don’t reliably return coordinates.

Common Mistakes and How to Fix Them

Mistake 1: No max_tokens on vision calls

Wrong:

payload = {"model": "gpt-4o", "messages": [{"role": "user", "content": [...]}]}

Why: OpenAI vision calls sometimes default to very short outputs (as low as 16 tokens). Your response gets cut off.

Fix:

payload = {"model": "gpt-4o", "max_tokens": 1024, "messages": [...]}

Mistake 2: Data URL prefix in Claude’s data field

Wrong:

"data": f"data:{mime};base64,{img_b64}"   # WRONG for Claude

Why: Claude expects raw base64 in data. Adding the data URL prefix causes a 400 error. OpenAI uses the prefix. Claude does not.

Fix:

python

"data": img_b64   # Raw base64 string, no prefix

Mistake 3: Header auth for Gemini

Wrong:

headers = {"Authorization": f"Bearer {GEMINI_KEY}"}  # Nope

Why: Gemini uses a query parameter for auth, not a header. OpenAI/Claude use headers. Mixing them up causes silent 401 or 403 errors.

Fix: Put the key in the URL: ?key={GEMINI_KEY}

Summary and Practice Exercise

You built a complete multimodal pipeline from scratch:

Base64 encoding — converting images and audio to JSON-safe strings
Three vision APIs — GPT-4o, Claude, Gemini with different JSON but the same core pattern
Four real tasks — classification, chart analysis, receipt OCR, audio transcription
A unified wrapper — one function for all providers
Error handling — retries, rate limits, payload checks

The field moves fast. OpenAI is adding native audio. Anthropic is expanding PDF support. Google keeps adding modalities. But the base64-in-JSON pattern you learned here carries forward.

Practice Exercise: Build a Document Analyzer (click to expand)

**Task:** Write a function `analyze_document(image_path)` that:
1. Sends a document image to all three providers
2. Asks each to extract the document type and any dates found
3. Returns a dictionary with all three results

**Starter code:**

def analyze_document(image_path):
    prompt = (
        "Analyze this document. Return JSON: "
        '{"doc_type": "...", "dates_found": ["YYYY-MM-DD"]}'
    )
    results = {}
    for provider in ["openai", "anthropic", "google"]:
        try:
            raw = ask_multimodal(provider, image_path, prompt)
            results[provider] = json.loads(raw)
        except Exception as e:
            results[provider] = {"error": str(e)}
    return results

**Solution:** The starter code IS the solution. The key insight: reuse `ask_multimodal()` in a loop. The prompt engineering (requesting JSON with a specific schema) works identically across all providers.

Complete Code

Click to expand the full script (copy-paste and run)

# Complete code from: Multimodal AI in Python
# Requires: pip install requests matplotlib
# Python 3.9+
# Env vars: OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY

import requests
import base64
import os
import json
import time

OPENAI_KEY = os.environ.get("OPENAI_API_KEY")
ANTHROPIC_KEY = os.environ.get("ANTHROPIC_API_KEY")
GEMINI_KEY = os.environ.get("GEMINI_API_KEY")

def encode_file_to_base64(file_path):
    extension = file_path.rsplit(".", 1)[-1].lower()
    mime_map = {
        "jpg": "image/jpeg", "jpeg": "image/jpeg",
        "png": "image/png", "gif": "image/gif",
        "webp": "image/webp", "wav": "audio/wav",
        "mp3": "audio/mp3", "pdf": "application/pdf"
    }
    mime_type = mime_map.get(extension, "application/octet-stream")
    with open(file_path, "rb") as f:
        encoded = base64.b64encode(f.read()).decode("utf-8")
    return encoded, mime_type

def ask_gpt4o_vision(image_path, prompt, detail="auto"):
    img_b64, mime = encode_file_to_base64(image_path)
    payload = {
        "model": "gpt-4o", "max_tokens": 1024,
        "messages": [{"role": "user", "content": [
            {"type": "text", "text": prompt},
            {"type": "image_url", "image_url": {
                "url": f"data:{mime};base64,{img_b64}", "detail": detail
            }}
        ]}]
    }
    headers = {"Authorization": f"Bearer {OPENAI_KEY}", "Content-Type": "application/json"}
    resp = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
    resp.raise_for_status()
    return resp.json()["choices"][0]["message"]["content"]

def ask_claude_vision(image_path, prompt):
    img_b64, mime = encode_file_to_base64(image_path)
    payload = {
        "model": "claude-sonnet-4-20250514", "max_tokens": 1024,
        "messages": [{"role": "user", "content": [
            {"type": "image", "source": {"type": "base64", "media_type": mime, "data": img_b64}},
            {"type": "text", "text": prompt}
        ]}]
    }
    headers = {"x-api-key": ANTHROPIC_KEY, "anthropic-version": "2023-06-01", "Content-Type": "application/json"}
    resp = requests.post("https://api.anthropic.com/v1/messages", headers=headers, json=payload)
    resp.raise_for_status()
    return resp.json()["content"][0]["text"]

def ask_gemini_vision(image_path, prompt):
    img_b64, mime = encode_file_to_base64(image_path)
    url = f"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent?key={GEMINI_KEY}"
    payload = {"contents": [{"parts": [
        {"inline_data": {"mime_type": mime, "data": img_b64}},
        {"text": prompt}
    ]}]}
    resp = requests.post(url, json=payload)
    resp.raise_for_status()
    return resp.json()["candidates"][0]["content"]["parts"][0]["text"]

def _call_openai(b64, mime, prompt, max_tokens, kwargs):
    detail = kwargs.get("detail", "auto")
    payload = {
        "model": kwargs.get("model", "gpt-4o"), "max_tokens": max_tokens,
        "messages": [{"role": "user", "content": [
            {"type": "text", "text": prompt},
            {"type": "image_url", "image_url": {"url": f"data:{mime};base64,{b64}", "detail": detail}}
        ]}]
    }
    headers = {"Authorization": f"Bearer {OPENAI_KEY}", "Content-Type": "application/json"}
    resp = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
    resp.raise_for_status()
    return resp.json()["choices"][0]["message"]["content"]

def _call_anthropic(b64, mime, prompt, max_tokens, kwargs):
    payload = {
        "model": kwargs.get("model", "claude-sonnet-4-20250514"), "max_tokens": max_tokens,
        "messages": [{"role": "user", "content": [
            {"type": "image", "source": {"type": "base64", "media_type": mime, "data": b64}},
            {"type": "text", "text": prompt}
        ]}]
    }
    headers = {"x-api-key": ANTHROPIC_KEY, "anthropic-version": "2023-06-01", "Content-Type": "application/json"}
    resp = requests.post("https://api.anthropic.com/v1/messages", headers=headers, json=payload)
    resp.raise_for_status()
    return resp.json()["content"][0]["text"]

def _call_google(b64, mime, prompt, kwargs):
    model = kwargs.get("model", "gemini-2.5-flash")
    url = f"https://generativelanguage.googleapis.com/v1beta/models/{model}:generateContent?key={GEMINI_KEY}"
    payload = {"contents": [{"parts": [
        {"inline_data": {"mime_type": mime, "data": b64}},
        {"text": prompt}
    ]}]}
    resp = requests.post(url, json=payload)
    resp.raise_for_status()
    return resp.json()["candidates"][0]["content"]["parts"][0]["text"]

def ask_multimodal(provider, file_path, prompt, **kwargs):
    file_b64, mime = encode_file_to_base64(file_path)
    max_tokens = kwargs.get("max_tokens", 1024)
    if provider == "openai":
        return _call_openai(file_b64, mime, prompt, max_tokens, kwargs)
    elif provider == "anthropic":
        return _call_anthropic(file_b64, mime, prompt, max_tokens, kwargs)
    elif provider == "google":
        return _call_google(file_b64, mime, prompt, kwargs)
    else:
        raise ValueError(f"Unknown provider: {provider}")

def transcribe_audio_gemini(audio_path, prompt=None):
    if prompt is None:
        prompt = "Transcribe this audio exactly."
    audio_b64, mime = encode_file_to_base64(audio_path)
    url = f"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent?key={GEMINI_KEY}"
    payload = {"contents": [{"parts": [
        {"inline_data": {"mime_type": mime, "data": audio_b64}},
        {"text": prompt}
    ]}]}
    resp = requests.post(url, json=payload)
    resp.raise_for_status()
    return resp.json()["candidates"][0]["content"]["parts"][0]["text"]

def ask_multimodal_safe(provider, file_path, prompt, max_retries=3):
    for attempt in range(max_retries):
        try:
            return ask_multimodal(provider, file_path, prompt)
        except requests.exceptions.HTTPError as e:
            if e.response.status_code == 429:
                time.sleep(2 ** attempt)
                continue
            elif e.response.status_code == 413:
                print(f"File too large for {provider}.")
                return None
            else:
                print(f"HTTP {e.response.status_code}: {e.response.text[:200]}")
                return None
        except requests.exceptions.Timeout:
            time.sleep(2 ** attempt)
            continue
    print(f"Failed after {max_retries} attempts.")
    return None

print("Script completed successfully.")

Frequently Asked Questions

Can I send a PDF directly to these APIs?

Claude and Gemini accept PDF files natively. Base64-encode the PDF and set MIME type to application/pdf. GPT-4o doesn’t support PDF directly — convert each page to an image first using pdf2image, then send the page images.

How much does a vision API call cost?

GPT-4o charges by resolution: ~$0.0075 for 512×512 “low detail,” up to ~$0.03 for high-resolution. Claude charges per image token (~$0.01-0.04 per image). Gemini 2.5 Flash is cheapest at ~$0.002 per image. Audio on Gemini costs ~$0.001 per 15 seconds.

Can these models read handwriting?

Yes, but accuracy depends on legibility. Printed text hits 95%+ across all three. Neat handwriting drops to 85-90%. Messy handwriting sits around 60-75%. Claude tends to perform best on handwritten documents.

URL vs base64 — which should I use?

With a URL, the API fetches the image from the internet. With base64, you embed it in the request. Use URLs for public images (faster, smaller payload). Use base64 for local or private files.

How do I handle images larger than 20 MB?

Resize before encoding. Use Pillow: img = Image.open("huge.jpg"); img.thumbnail((2048, 2048)); img.save("resized.jpg", quality=85). This gets most photos under 1 MB.

References

OpenAI documentation — Images and Vision. Link
Anthropic documentation — Vision. Link
Google AI documentation — Image understanding. Link
Google AI documentation — Audio understanding. Link
OpenAI documentation — Chat Completions API. Link
Anthropic documentation — Messages API. Link
Google AI documentation — Gemini API reference. Link
OpenAI Cookbook — GPT-4o with Vision. Link

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Gen AI — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

Multimodal AI Tutorial: GPT-4o Vision & Audio API

What Is Multimodal AI?

Setting Up: Prerequisites and Base64 Encoding

Prerequisites

Sending Images to GPT-4o (OpenAI Vision)

Sending Images to Claude (Anthropic Vision)

Sending Images to Gemini (Google Vision)

Comparing Vision Capabilities Across Providers

Building a Unified Multimodal Function

Real Task 1: Chart and Diagram Analysis

Real Task 2: Receipt OCR and Data Extraction

Real Task 3: Audio Transcription with Gemini

Sending Multiple Images in One Request

Error Handling for Multimodal Calls

When NOT to Use Multimodal LLMs

Common Mistakes and How to Fix Them

Mistake 1: No max_tokens on vision calls

Mistake 2: Data URL prefix in Claude’s data field

Mistake 3: Header auth for Gemini

Summary and Practice Exercise

Complete Code

Frequently Asked Questions

Can I send a PDF directly to these APIs?

How much does a vision API call cost?

Can these models read handwriting?

URL vs base64 — which should I use?

How do I handle images larger than 20 MB?

References

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

What Is Multimodal AI?

Setting Up: Prerequisites and Base64 Encoding

Prerequisites

Sending Images to GPT-4o (OpenAI Vision)

Sending Images to Claude (Anthropic Vision)

Sending Images to Gemini (Google Vision)

Comparing Vision Capabilities Across Providers

Building a Unified Multimodal Function

Real Task 1: Chart and Diagram Analysis

Real Task 2: Receipt OCR and Data Extraction

Real Task 3: Audio Transcription with Gemini

Sending Multiple Images in One Request

Error Handling for Multimodal Calls

When NOT to Use Multimodal LLMs

Common Mistakes and How to Fix Them

Mistake 1: No max_tokens on vision calls

Mistake 2: Data URL prefix in Claude’s data field

Mistake 3: Header auth for Gemini

Summary and Practice Exercise

Complete Code

Frequently Asked Questions

Can I send a PDF directly to these APIs?

How much does a vision API call cost?

Can these models read handwriting?

URL vs base64 — which should I use?

How do I handle images larger than 20 MB?

References

Related Articles

LLM Temperature, Top-P, and Top-K Explained — With Python Simulations

OpenAI API Python Tutorial — Chat Completions, Streaming & Error Handling

Zero-Shot vs Few-Shot Prompting: Complete Guide

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science