Gemini API Tutorial: Multimodal AI in Python

Build a multimodal document analyzer with the Google Gemini API in Python. Analyze images, PDFs, and text with structured JSON output — using raw HTTP requests.

Written by Selva Prabhakaran | 27 min read

Send text, images, and PDFs to one API endpoint and get structured answers back — using raw HTTP requests you can run anywhere.

This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.

You hand an AI model a scanned invoice, a product photo, and a paragraph of text. It reads all three and answers your question in valid JSON. No separate OCR step. No image-to-text pipeline. One API call handles it all.

That’s the Gemini API’s superpower — native multimodal input. In this article, you’ll build a document analyzer that uses it.

You’ll send text prompts, analyze images, and extract data from PDFs. You’ll configure safety filters, ground responses with live Google Search, and force JSON replies.

We’ll use raw HTTP requests to generativelanguage.googleapis.com. No SDK needed. Every code block runs in Pyodide or any standard Python environment.

What Is the Gemini API?

The Gemini API is Google’s way to access its Gemini family of large language models. What makes Gemini different from text-only models? It was trained on text, images, audio, and video from day one. Multimodal understanding isn’t an add-on. It’s part of the core design.

You talk to it through a single REST endpoint. Send a JSON payload with your content — text, base64 images, PDF data — and get JSON back.

python

https://generativelanguage.googleapis.com/v1beta/models/{model}:generateContent

You authenticate with an API key as a query parameter. No OAuth flows or service accounts needed for basic usage.

KEY INSIGHT: Unlike text-only APIs, Gemini’s contents array can mix text and binary data in one request. Describe what you want in text, attach the file as base64, and the model reasons across both at once.

Setting Up: API Key and First Request

Before writing code, you need a Gemini API key. I recommend getting this done first — nothing worse than writing code and then waiting for key provisioning.

Prerequisites

Python version: 3.9+
Required libraries: None beyond the standard library (urllib.request and json)
API key: Free at aistudio.google.com
Time to complete: 25-30 minutes

Getting Your API Key

Go to Google AI Studio and click “Create API Key.” Copy it and store it safely. You’ll pass it as a query parameter in every request.

WARNING: Never hardcode API keys in scripts you share or commit. Use environment variables or a .env file. Here, we assign it to a variable at the top so you can swap it easily.

Let’s make our first call. The code below builds a JSON payload with a text prompt. It sends an HTTP POST to the Gemini API and parses the response. We use urllib.request so it works without pip installs — even inside Pyodide.

import urllib.request
import json
import base64

# Replace with your actual API key
GEMINI_API_KEY = "YOUR_API_KEY_HERE"
MODEL = "gemini-2.5-flash"
BASE_URL = f"https://generativelanguage.googleapis.com/v1beta/models/{MODEL}"

def gemini_request(endpoint, payload):
    """Send a POST request to the Gemini API and return parsed JSON."""
    url = f"{BASE_URL}:{endpoint}?key={GEMINI_API_KEY}"
    data = json.dumps(payload).encode("utf-8")
    req = urllib.request.Request(
        url, data=data,
        headers={"Content-Type": "application/json"}
    )
    with urllib.request.urlopen(req) as resp:
        return json.loads(resp.read().decode("utf-8"))

# Simple text generation
payload = {
    "contents": [{
        "parts": [{"text": "What are the three main types of machine learning? One sentence each."}]
    }]
}

result = gemini_request("generateContent", payload)
print(result["candidates"][0]["content"]["parts"][0]["text"])

The response has a candidates array. Each candidate has a content object with parts. For text, the first part’s text field holds the answer.

Notice the request body. The contents array holds conversation turns. Each turn has a parts array. A part can be text, an image, or a file. This structure powers everything we’ll build.

Here’s the full shape of a request — keep this mental map handy:

# Anatomy of a generateContent request (reference — not runnable)
request_shape = {
    "contents": [{"role": "user", "parts": [...]}],  # Conversation turns
    "generationConfig": {"temperature": 0.7},          # Output controls
    "safetySettings": [...],                            # Content filters
    "tools": [...]                                      # Search grounding, etc.
}

Four keys matter. contents holds the conversation. generationConfig controls temperature and token limits. safetySettings sets filter thresholds. tools enables search grounding. We’ll use all four.

TIP: Set temperature between 0.0 and 0.3 for factual extraction. Use 0.7-1.0 for creative tasks like brainstorming.

Multi-Turn Conversations with the Gemini API

So far we’ve sent single-turn requests. But what if you need the model to remember context? Maybe you upload a document, ask a question, then ask a follow-up.

The contents array supports multiple turns. Each turn has a role — either "user" or "model". Include the model’s previous response in the array. The model sees the full history and answers in context.

payload = {
    "contents": [
        {"role": "user", "parts": [{"text": "What is pandas in Python?"}]},
        {"role": "model", "parts": [{"text": "Pandas is a data analysis library for Python."}]},
        {"role": "user", "parts": [{"text": "What's its most important data structure?"}]}
    ]
}

result = gemini_request("generateContent", payload)
print(result["candidates"][0]["content"]["parts"][0]["text"])

The model builds on context. It knows “its” refers to pandas because it can see the full exchange. You can mix images into any turn too. Send an image in turn 1, then ask text questions in turns 2 and 3.

Handling Gemini API Errors

API calls fail. Rate limits, bad keys, server errors — you’ll hit all of them eventually. Here’s how to handle the common ones gracefully.

The Gemini API returns standard HTTP status codes. Common ones: 400 (bad request), 403 (invalid key), 429 (rate limit), and 500 (server error). Let’s wrap our request function with error handling using urllib.error.

from urllib.error import HTTPError, URLError

def safe_gemini_request(endpoint, payload):
    """Gemini API call with error handling."""
    url = f"{BASE_URL}:{endpoint}?key={GEMINI_API_KEY}"
    data = json.dumps(payload).encode("utf-8")
    req = urllib.request.Request(
        url, data=data, headers={"Content-Type": "application/json"}
    )
    try:
        with urllib.request.urlopen(req) as resp:
            return json.loads(resp.read().decode("utf-8"))
    except HTTPError as e:
        body = e.read().decode("utf-8", errors="replace")
        print(f"HTTP {e.code}: {body[:200]}")
        return None
    except URLError as e:
        print(f"Connection error: {e.reason}")
        return None

# Test with an intentionally bad key
result = safe_gemini_request("generateContent", payload)
if result:
    print(result["candidates"][0]["content"]["parts"][0]["text"])
else:
    print("Request failed — check the error above.")

TIP: For rate limit errors (429), wait and retry. Google’s free tier allows roughly 15 requests per minute. In production, add exponential backoff — wait 1 second, then 2, then 4.

Analyzing Images with the Gemini API

Here’s where Gemini’s multimodal capability earns its name. Send an image with a text prompt, and the model reasons about both together. No separate vision API. No preprocessing.

How do you send an image? Encode it as base64 and add it as an inline_data part next to your text prompt. The mime_type tells Gemini what kind of file it is. Let’s download a sample image, encode it, and ask Gemini to describe it.

# Download a sample image for analysis
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/ARA_San_Juan_search_area.jpg/640px-ARA_San_Juan_search_area.jpg"
img_req = urllib.request.Request(image_url, headers={"User-Agent": "Mozilla/5.0"})
with urllib.request.urlopen(img_req) as resp:
    image_bytes = resp.read()

image_base64 = base64.b64encode(image_bytes).decode("utf-8")

payload = {
    "contents": [{
        "parts": [
            {"text": "Describe this image in 2-3 sentences."},
            {"inline_data": {"mime_type": "image/jpeg", "data": image_base64}}
        ]
    }]
}

result = gemini_request("generateContent", payload)
print(result["candidates"][0]["content"]["parts"][0]["text"])

Text and image sit side by side in the same parts array. Gemini sees them as one context. Ask “What color is the largest object?” or “How many people are visible?” and it answers from what it sees.

What about multiple images? Just add more inline_data parts. Both images go into the same array. Gemini handles them together.

# Compare two images in one request (using same image twice as demo)
payload = {
    "contents": [{
        "parts": [
            {"text": "I'm sending two images. How do they differ?"},
            {"inline_data": {"mime_type": "image/jpeg", "data": image_base64}},
            {"inline_data": {"mime_type": "image/jpeg", "data": image_base64}}
        ]
    }]
}

result = gemini_request("generateContent", payload)
print(result["candidates"][0]["content"]["parts"][0]["text"])

When I first tested this with invoice scans, the results surprised me. Gemini read handwritten totals, spotted logos, and matched line items — all in one call.

COMMON MISTAKE: Sending an image without the correct mime_type causes the API to reject the request or produce garbage. Always match it: image/jpeg for JPGs, image/png for PNGs, image/webp for WebP.

Now you know how to analyze images. Let’s see if you can build on that.

Exercise 1: Build an Image Captioner

Write a function called caption_image that takes a base64-encoded image and returns a JSON object with two fields: caption (one-sentence description) and mood (emotional tone). Use gemini_request and structured output.

Hint 1

Set `responseMimeType` to `”application/json”` inside `generationConfig`. Define a `responseSchema` with `caption` (string) and `mood` (string) as required properties.

Hint 2

Your payload needs `contents` with a text part and an `inline_data` part, plus a `generationConfig` block. The text prompt should ask for a caption and mood.

Solution

def caption_image(image_b64, mime_type="image/jpeg"):
    schema = {
        "type": "object",
        "properties": {
            "caption": {"type": "string"},
            "mood": {"type": "string"}
        },
        "required": ["caption", "mood"]
    }
    payload = {
        "contents": [{"parts": [
            {"text": "Describe this image in one sentence and identify its emotional mood."},
            {"inline_data": {"mime_type": mime_type, "data": image_b64}}
        ]}],
        "generationConfig": {
            "responseMimeType": "application/json",
            "responseSchema": schema
        }
    }
    result = gemini_request("generateContent", payload)
    return json.loads(
        result["candidates"][0]["content"]["parts"][0]["text"]
    )

# Test it
output = caption_image(image_base64)
print(json.dumps(output, indent=2))

**Explanation:** The `responseSchema` forces Gemini to return valid JSON matching your structure. The `responseMimeType` switches the model from free-text to JSON mode. We parse the result with `json.loads()` to get a Python dict.

Analyzing PDFs with the Gemini API

PDFs are where a document analyzer earns its keep. Gemini accepts PDF data directly — no need to convert pages to images first. Encode the bytes as base64, set MIME type to application/pdf, and send it like an image.

The model reads text, sees charts, and gets page layouts. For a data team, this means you can pull tables from papers or summarize a 50-page spec — all in one API call.

Here’s how. We’ll download a public PDF, encode it, and ask Gemini to extract information. The inline_data approach is identical to images. Only the mime_type changes.

# Download a sample PDF
pdf_url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
pdf_req = urllib.request.Request(pdf_url, headers={"User-Agent": "Mozilla/5.0"})
with urllib.request.urlopen(pdf_req) as resp:
    pdf_bytes = resp.read()

pdf_base64 = base64.b64encode(pdf_bytes).decode("utf-8")

payload = {
    "contents": [{
        "parts": [
            {"text": "Read this PDF and list its contents in 2-3 bullet points."},
            {"inline_data": {"mime_type": "application/pdf", "data": pdf_base64}}
        ]
    }]
}

result = gemini_request("generateContent", payload)
print(result["candidates"][0]["content"]["parts"][0]["text"])

You can get very specific with your prompts. Try “Extract every date and dollar amount” or “List all section headings.” The model handles structured extraction well — especially with JSON mode (coming up next).

KEY INSIGHT: Gemini processes PDFs using vision — it “sees” each page as an image while also reading embedded text. Scanned documents work just as well as digital ones.

Gemini API Safety Settings

Every Gemini response passes through content safety filters. By default, the API blocks content it sees as harmful. Sometimes the filters are too strict for real work — like analyzing medical docs or security reports.

You control the threshold for each category. Here are the four categories:

Category	What It Catches
`HARM_CATEGORY_HARASSMENT`	Threats, bullying, targeted insults
`HARM_CATEGORY_HATE_SPEECH`	Slurs, discrimination
`HARM_CATEGORY_SEXUALLY_EXPLICIT`	Sexual content
`HARM_CATEGORY_DANGEROUS_CONTENT`	Instructions for harm, weapons

Each accepts a threshold: BLOCK_NONE, BLOCK_ONLY_HIGH, BLOCK_MEDIUM_AND_ABOVE, or BLOCK_LOW_AND_ABOVE. Default is BLOCK_MEDIUM_AND_ABOVE.

Let’s lower the filter to block only the worst content. The safetySettings array goes at the top level, next to contents. This helps with security topics that would trip the default filters.

payload = {
    "contents": [{
        "parts": [{"text": "Explain common cybersecurity attack vectors and defenses."}]
    }],
    "safetySettings": [
        {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_ONLY_HIGH"},
        {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_ONLY_HIGH"},
        {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_ONLY_HIGH"},
        {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_ONLY_HIGH"}
    ]
}

result = gemini_request("generateContent", payload)
print(result["candidates"][0]["content"]["parts"][0]["text"][:500])

If a response gets blocked, the candidates array has a finishReason of "SAFETY" instead of "STOP". Always check for this in production.

WARNING: BLOCK_NONE removes all safety filtering. Only use it in controlled environments. Google may revoke API access if generated content violates their policies — even with filters disabled.

Gemini API Grounding with Google Search

Every LLM has a training data cutoff. Ask about yesterday’s stock price, and the model either refuses or makes something up. Sound familiar?

Grounding fixes this. Add a tools parameter, and Gemini searches Google in real time before answering. The response includes citations you can verify.

Why does this matter for document analysis? Say you’re reading a company’s annual report and want to check their revenue against the latest public figure. Grounding lets the model fetch live data for that comparison.

The setup is minimal. Add "tools": [{"google_search": {}}] to the request body. Gemini decides when to search based on your prompt.

payload = {
    "contents": [{
        "parts": [{"text": "What is Google's current stock price and market cap?"}]
    }],
    "tools": [{"google_search": {}}]
}

result = gemini_request("generateContent", payload)
text = result["candidates"][0]["content"]["parts"][0]["text"]
print(text[:500])

# Check for grounding metadata
if "groundingMetadata" in result["candidates"][0]:
    sources = result["candidates"][0]["groundingMetadata"]
    print("\n--- Grounding Sources ---")
    print(json.dumps(sources, indent=2)[:500])

The response includes groundingMetadata — the search queries and web sources Gemini cited. Log these for transparency.

One thing to know: grounding adds latency. Each grounded request takes 2-5 extra seconds for the search step. For bulk processing, only turn it on when you need current data.

TIP: Combine grounding with multimodal input. Send a PDF of a quarterly report plus google_search, and ask Gemini to fact-check the numbers. It reads the PDF, searches the web, and gives you a comparison.

Gemini Structured Output: JSON Mode

Free-text responses are fine for chatbots. For a document analyzer, you want structured data — JSON you can feed into a database or dashboard.

Gemini handles this natively. Set responseMimeType to "application/json" in generationConfig. Add a schema if you want. The model returns valid JSON that matches your schema. No regex needed.

Here’s the real power: combine structured output with images or PDFs. Send an image and tell Gemini to fill a JSON template. The responseSchema defines field names and types. Gemini fills in the values from what it sees.

payload = {
    "contents": [{
        "parts": [
            {"text": "Analyze this image and extract information."},
            {"inline_data": {"mime_type": "image/jpeg", "data": image_base64}}
        ]
    }],
    "generationConfig": {
        "responseMimeType": "application/json",
        "responseSchema": {
            "type": "object",
            "properties": {
                "description": {"type": "string"},
                "objects_detected": {
                    "type": "array",
                    "items": {"type": "string"}
                },
                "dominant_colors": {
                    "type": "array",
                    "items": {"type": "string"}
                },
                "setting": {"type": "string"}
            },
            "required": ["description", "objects_detected"]
        }
    }
}

result = gemini_request("generateContent", payload)
data = json.loads(result["candidates"][0]["content"]["parts"][0]["text"])
print(json.dumps(data, indent=2))

The model returns a JSON string that fits your schema. Parse it with json.loads() to get a Python dict. Every field matches the type you asked for.

You can skip responseSchema and just describe the format in your prompt. But the schema is more reliable. I prefer it because the model can’t stray from your structure.

KEY INSIGHT: Structured output turns Gemini from a chatbot into a data extraction engine. With multimodal input, you build pipelines that take raw documents and output clean, typed records — no post-processing.

Exercise 2: Multi-Document Comparison

Write a function called compare_pdfs that takes two PDFs (base64 strings) and a comparison question. Send both in one request and return the model’s text analysis.

Hint 1

Both PDFs go into the same `parts` array as separate `inline_data` entries. Add a text part with your comparison question first.

Hint 2

The parts list needs three items: one text part, then two `inline_data` parts with `mime_type: “application/pdf”`. Reference “the first document” and “the second document” in your prompt.

Solution

def compare_pdfs(pdf1_b64, pdf2_b64, question):
    payload = {
        "contents": [{"parts": [
            {"text": f"I'm giving you two PDF documents. {question}"},
            {"inline_data": {"mime_type": "application/pdf", "data": pdf1_b64}},
            {"inline_data": {"mime_type": "application/pdf", "data": pdf2_b64}}
        ]}]
    }
    result = gemini_request("generateContent", payload)
    return result["candidates"][0]["content"]["parts"][0]["text"]

# Test (using same PDF twice — use different PDFs in practice)
answer = compare_pdfs(pdf_base64, pdf_base64, "What differences do you see?")
print(answer)

**Explanation:** Multiple documents go into the same `parts` array. Gemini processes them as a single context. The text prompt references both documents, and the model compares them side by side.

Building the Complete Document Analyzer

Let’s tie everything together. We’ll build a DocumentAnalyzer class that combines multimodal input, safety settings, optional grounding, and structured output.

The class wraps gemini_request with one method per document type. analyze_text handles plain text. analyze_image takes base64 image data. analyze_pdf takes base64 PDF data. Each method takes an optional JSON schema.

class DocumentAnalyzer:
    """Multimodal document analyzer powered by the Gemini API."""

    def __init__(self, api_key, model="gemini-2.5-flash"):
        self.api_key = api_key
        self.model = model
        self.base_url = (
            f"https://generativelanguage.googleapis.com/v1beta/models/{model}"
        )
        self.safety_settings = [
            {"category": c, "threshold": "BLOCK_MEDIUM_AND_ABOVE"}
            for c in [
                "HARM_CATEGORY_HARASSMENT",
                "HARM_CATEGORY_HATE_SPEECH",
                "HARM_CATEGORY_SEXUALLY_EXPLICIT",
                "HARM_CATEGORY_DANGEROUS_CONTENT"
            ]
        ]

    def _request(self, payload):
        url = f"{self.base_url}:generateContent?key={self.api_key}"
        data = json.dumps(payload).encode("utf-8")
        req = urllib.request.Request(
            url, data=data,
            headers={"Content-Type": "application/json"}
        )
        with urllib.request.urlopen(req) as resp:
            return json.loads(resp.read().decode("utf-8"))

That’s the base — init and a private request method. Next come the payload builder and the three analysis methods. _build_payload puts together contents, safetySettings, and optional config.

def _build_payload(self, parts, schema=None, grounding=False):
        payload = {
            "contents": [{"parts": parts}],
            "safetySettings": self.safety_settings
        }
        if schema:
            payload["generationConfig"] = {
                "responseMimeType": "application/json",
                "responseSchema": schema
            }
        if grounding:
            payload["tools"] = [{"google_search": {}}]
        return payload

    def analyze_text(self, prompt, schema=None, grounding=False):
        parts = [{"text": prompt}]
        result = self._request(self._build_payload(parts, schema, grounding))
        return self._parse(result, schema)

    def analyze_image(self, prompt, img_b64, mime="image/jpeg", schema=None):
        parts = [
            {"text": prompt},
            {"inline_data": {"mime_type": mime, "data": img_b64}}
        ]
        result = self._request(self._build_payload(parts, schema))
        return self._parse(result, schema)

    def analyze_pdf(self, prompt, pdf_b64, schema=None):
        parts = [
            {"text": prompt},
            {"inline_data": {"mime_type": "application/pdf", "data": pdf_b64}}
        ]
        result = self._request(self._build_payload(parts, schema))
        return self._parse(result, schema)

    def _parse(self, result, schema=None):
        candidate = result["candidates"][0]
        if candidate.get("finishReason") == "SAFETY":
            return {"error": "Response blocked by safety filters"}
        text = candidate["content"]["parts"][0]["text"]
        return json.loads(text) if schema else text

Quick test to confirm it works:

analyzer = DocumentAnalyzer(GEMINI_API_KEY)
print(analyzer.analyze_text("What is the capital of France?"))

Now let’s use it for a real task. We’ll analyze the image we downloaded earlier with a structured schema. The schema tells Gemini what fields to return. analyze_image handles the rest.

image_schema = {
    "type": "object",
    "properties": {
        "scene_description": {"type": "string"},
        "key_elements": {"type": "array", "items": {"type": "string"}},
        "use_case": {"type": "string"}
    },
    "required": ["scene_description", "key_elements"]
}

result = analyzer.analyze_image(
    prompt="Analyze this image for a content management system.",
    img_b64=image_base64,
    schema=image_schema
)
print(json.dumps(result, indent=2))

That’s a working document analyzer in about 80 lines. It takes text, images, and PDFs. It gives back structured JSON. It runs anywhere Python runs — no SDK needed.

Common Mistakes and How to Fix Them

Mistake 1: Forgetting to base64-encode binary data

❌ Wrong:

# Passing raw bytes — JSON can't serialize this
payload = {"contents": [{"parts": [
    {"inline_data": {"mime_type": "image/png", "data": image_bytes}}
]}]}
# Result: TypeError or API error

Why it breaks: The data field needs a base64 string. Raw bytes can’t go into JSON.

✅ Correct:

image_base64 = base64.b64encode(image_bytes).decode("utf-8")
payload = {"contents": [{"parts": [
    {"inline_data": {"mime_type": "image/png", "data": image_base64}}
]}]}

Mistake 2: Not checking the finish reason

❌ Wrong:

# Assumes the response always has text content
text = result["candidates"][0]["content"]["parts"][0]["text"]

Why it breaks: If safety filters block the response, content may be missing. Your code crashes with a KeyError.

✅ Correct:

candidate = result["candidates"][0]
if candidate.get("finishReason") == "SAFETY":
    print("Response blocked by safety filters")
else:
    print(candidate["content"]["parts"][0]["text"])

Mistake 3: Using v1 instead of v1beta

❌ Wrong:

# v1 endpoint — missing newer features
url = "https://generativelanguage.googleapis.com/v1/models/gemini-2.5-flash:generateContent"

Why it fails: Grounding, structured output schemas, and the latest models live only in v1beta.

✅ Correct:

url = "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent"

Gemini vs GPT-4 vs Claude: Multimodal API Comparison

How does Gemini stack up against other multimodal APIs? Here’s a quick comparison for document analysis tasks.

Feature	Gemini 2.5 Flash	GPT-4o (OpenAI)	Claude 3.5 Sonnet
Text input	Yes	Yes	Yes
Image input	Yes (base64 + URI)	Yes (base64 + URL)	Yes (base64)
Native PDF input	Yes	No (convert to images)	Yes (beta)
Audio/video input	Yes	Audio only (GPT-4o)	No
JSON schema enforcement	Yes (`responseSchema`)	Yes (`response_format`)	Yes (tool use)
Search grounding	Built-in (`google_search`)	Requires plugins/tools	No built-in
Free tier	500 req/day	No free tier	Limited free tier
Raw HTTP (no SDK)	Yes	Yes	Yes

Gemini’s biggest edge: native PDF support and built-in search grounding. If your documents mix text and images, Gemini needs less prep work than the others.

GPT-4o has more third-party tools. Claude is great at long-context tasks. Pick based on what you need most.

When NOT to Use the Gemini API

Gemini is powerful, but it’s not always the right tool.

High-volume, low-latency processing. Need to process 10,000 images per minute? The API’s per-request latency (1-10 seconds) becomes a wall. Use dedicated models — YOLO for detection, Tesseract for OCR.

Need for exact same output every time. Responses vary between identical requests. If rules demand byte-identical results, use rule-based extraction or templates.

Data that can’t leave your network. Every request hits Google’s servers. For HIPAA or GDPR data, you need on-premise tools. Vertex AI offers data residency, but it’s a different (and pricier) product.

Simple documents with known layouts. If every invoice follows the same template, regex or PyMuPDF is faster, cheaper, and more reliable.

TIP: I use Gemini for the “messy middle” — documents too varied for rules but too few to justify a custom model. That’s where multimodal AI earns its cost.

Summary

You’ve built a multimodal document analyzer with the Gemini API’s raw HTTP endpoint. Here’s what you covered:

Text generation — Sending prompts and parsing the candidates response.
Multi-turn conversations — Using role fields to maintain context across turns.
Error handling — Catching HTTP errors gracefully with status-specific responses.
Image analysis — Base64 encoding images and mixing them with text in parts.
PDF processing — Same as images, with mime_type: "application/pdf".
Safety settings — Adjusting filter thresholds per harm category.
Grounding — Adding tools: [{"google_search": {}}] for live web data.
Structured output — Forcing JSON with responseMimeType and responseSchema.
DocumentAnalyzer class — A reusable wrapper combining all capabilities.

Practice Exercise

Build a “Research Assistant” function. It takes a topic, uses grounding to search the web, and returns JSON with summary (2-3 sentences), key_facts (string array), and sources (URLs from grounding metadata).

Solution

def research_topic(topic):
    schema = {
        "type": "object",
        "properties": {
            "summary": {"type": "string"},
            "key_facts": {"type": "array", "items": {"type": "string"}}
        },
        "required": ["summary", "key_facts"]
    }
    payload = {
        "contents": [{"parts": [
            {"text": f"Research this topic briefly: {topic}"}
        ]}],
        "generationConfig": {
            "responseMimeType": "application/json",
            "responseSchema": schema
        },
        "tools": [{"google_search": {}}]
    }
    result = gemini_request("generateContent", payload)
    data = json.loads(
        result["candidates"][0]["content"]["parts"][0]["text"]
    )

    # Extract grounding sources
    sources = []
    meta = result["candidates"][0].get("groundingMetadata", {})
    for chunk in meta.get("groundingChunks", []):
        if "web" in chunk:
            sources.append(chunk["web"].get("uri", ""))
    data["sources"] = sources
    return data

output = research_topic("quantum computing breakthroughs 2026")
print(json.dumps(output, indent=2))

**Explanation:** This combines structured output (`responseSchema`) with grounding (`google_search`). The model searches the web, writes a JSON response matching our schema, and we extract source URLs from `groundingMetadata` separately — since the schema controls only the text output, not the metadata.

Complete Code

Click to expand the full script (copy-paste and run)

# Complete code from: Google Gemini API — Build a Multimodal Document Analyzer
# Requires: Python 3.9+ (standard library only)
# API Key: https://aistudio.google.com/app/apikey

import urllib.request
import json
import base64
from urllib.error import HTTPError, URLError

# --- Configuration ---
GEMINI_API_KEY = "YOUR_API_KEY_HERE"
MODEL = "gemini-2.5-flash"
BASE_URL = f"https://generativelanguage.googleapis.com/v1beta/models/{MODEL}"

# --- Helper ---
def gemini_request(endpoint, payload):
    url = f"{BASE_URL}:{endpoint}?key={GEMINI_API_KEY}"
    data = json.dumps(payload).encode("utf-8")
    req = urllib.request.Request(
        url, data=data, headers={"Content-Type": "application/json"}
    )
    try:
        with urllib.request.urlopen(req) as resp:
            return json.loads(resp.read().decode("utf-8"))
    except HTTPError as e:
        print(f"HTTP {e.code}: {e.read().decode('utf-8', errors='replace')[:200]}")
        return None

# --- Text Generation ---
payload = {
    "contents": [{"parts": [
        {"text": "What are the three types of machine learning? One sentence each."}
    ]}]
}
result = gemini_request("generateContent", payload)
if result:
    print("=== Text ===")
    print(result["candidates"][0]["content"]["parts"][0]["text"])

# --- Multi-Turn ---
payload = {
    "contents": [
        {"role": "user", "parts": [{"text": "What is pandas in Python?"}]},
        {"role": "model", "parts": [{"text": "Pandas is a data analysis library."}]},
        {"role": "user", "parts": [{"text": "What's its main data structure?"}]}
    ]
}
result = gemini_request("generateContent", payload)
if result:
    print("\n=== Multi-Turn ===")
    print(result["candidates"][0]["content"]["parts"][0]["text"])

# --- Image Analysis ---
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/ARA_San_Juan_search_area.jpg/640px-ARA_San_Juan_search_area.jpg"
img_req = urllib.request.Request(image_url, headers={"User-Agent": "Mozilla/5.0"})
with urllib.request.urlopen(img_req) as resp:
    image_bytes = resp.read()
image_base64 = base64.b64encode(image_bytes).decode("utf-8")

payload = {
    "contents": [{"parts": [
        {"text": "Describe this image in 2-3 sentences."},
        {"inline_data": {"mime_type": "image/jpeg", "data": image_base64}}
    ]}]
}
result = gemini_request("generateContent", payload)
if result:
    print("\n=== Image ===")
    print(result["candidates"][0]["content"]["parts"][0]["text"])

# --- PDF Processing ---
pdf_url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
pdf_req = urllib.request.Request(pdf_url, headers={"User-Agent": "Mozilla/5.0"})
with urllib.request.urlopen(pdf_req) as resp:
    pdf_bytes = resp.read()
pdf_base64 = base64.b64encode(pdf_bytes).decode("utf-8")

payload = {
    "contents": [{"parts": [
        {"text": "Read this PDF and list its contents."},
        {"inline_data": {"mime_type": "application/pdf", "data": pdf_base64}}
    ]}]
}
result = gemini_request("generateContent", payload)
if result:
    print("\n=== PDF ===")
    print(result["candidates"][0]["content"]["parts"][0]["text"])

# --- Structured Output ---
payload = {
    "contents": [{"parts": [
        {"text": "Analyze this image."},
        {"inline_data": {"mime_type": "image/jpeg", "data": image_base64}}
    ]}],
    "generationConfig": {
        "responseMimeType": "application/json",
        "responseSchema": {
            "type": "object",
            "properties": {
                "description": {"type": "string"},
                "objects": {"type": "array", "items": {"type": "string"}}
            },
            "required": ["description", "objects"]
        }
    }
}
result = gemini_request("generateContent", payload)
if result:
    print("\n=== Structured ===")
    structured = json.loads(result["candidates"][0]["content"]["parts"][0]["text"])
    print(json.dumps(structured, indent=2))

# --- DocumentAnalyzer Class ---
class DocumentAnalyzer:
    def __init__(self, api_key, model="gemini-2.5-flash"):
        self.api_key = api_key
        self.model = model
        self.base_url = (
            f"https://generativelanguage.googleapis.com/v1beta/models/{model}"
        )
        self.safety_settings = [
            {"category": c, "threshold": "BLOCK_MEDIUM_AND_ABOVE"}
            for c in [
                "HARM_CATEGORY_HARASSMENT", "HARM_CATEGORY_HATE_SPEECH",
                "HARM_CATEGORY_SEXUALLY_EXPLICIT", "HARM_CATEGORY_DANGEROUS_CONTENT"
            ]
        ]

    def _request(self, payload):
        url = f"{self.base_url}:generateContent?key={self.api_key}"
        data = json.dumps(payload).encode("utf-8")
        req = urllib.request.Request(
            url, data=data, headers={"Content-Type": "application/json"}
        )
        with urllib.request.urlopen(req) as resp:
            return json.loads(resp.read().decode("utf-8"))

    def _build_payload(self, parts, schema=None, grounding=False):
        payload = {
            "contents": [{"parts": parts}],
            "safetySettings": self.safety_settings
        }
        if schema:
            payload["generationConfig"] = {
                "responseMimeType": "application/json",
                "responseSchema": schema
            }
        if grounding:
            payload["tools"] = [{"google_search": {}}]
        return payload

    def analyze_text(self, prompt, schema=None, grounding=False):
        parts = [{"text": prompt}]
        return self._parse(
            self._request(self._build_payload(parts, schema, grounding)), schema
        )

    def analyze_image(self, prompt, img_b64, mime="image/jpeg", schema=None):
        parts = [
            {"text": prompt},
            {"inline_data": {"mime_type": mime, "data": img_b64}}
        ]
        return self._parse(
            self._request(self._build_payload(parts, schema)), schema
        )

    def analyze_pdf(self, prompt, pdf_b64, schema=None):
        parts = [
            {"text": prompt},
            {"inline_data": {"mime_type": "application/pdf", "data": pdf_b64}}
        ]
        return self._parse(
            self._request(self._build_payload(parts, schema)), schema
        )

    def _parse(self, result, schema=None):
        candidate = result["candidates"][0]
        if candidate.get("finishReason") == "SAFETY":
            return {"error": "Blocked by safety filters"}
        text = candidate["content"]["parts"][0]["text"]
        return json.loads(text) if schema else text

analyzer = DocumentAnalyzer(GEMINI_API_KEY)
print("\n=== Analyzer Test ===")
print(analyzer.analyze_text("Capital of Japan?"))
print("\nScript completed successfully.")

Frequently Asked Questions

How much does the Gemini API cost?

Gemini 2.5 Flash has a free tier: 500 requests per day. Beyond that, it costs about $0.15 per million input tokens and $0.60 per million output tokens. One page image uses about 258 tokens. Check Google’s pricing page — rates change often.

Can Gemini process audio and video files?

Yes. Use the same inline_data approach with mime_type set to audio/mp3, audio/wav, video/mp4, or similar. For files over 20MB, upload through the File API first and reference the file URI instead.

What’s the maximum size for inline_data?

About 20MB of decoded data. For bigger files, use the File API. Upload the file, get a URI back, then use {"file_data": {"file_uri": "...", "mime_type": "..."}} in your request.

How does Gemini compare to GPT-4 Vision for documents?

Gemini has three edges: native PDF support, built-in Google Search grounding, and strict JSON schemas. GPT-4 Vision needs extra tools for grounding and can’t take PDFs directly.

Is the v1beta endpoint stable enough for production?

Google recommends v1beta for development and stands behind its stability. The v1 stable endpoint exists but lacks newer features. Check the Gemini docs for the latest.

References

Google AI for Developers — Gemini API Documentation. Link
Google — Generating Content with the Gemini API. Link
Google — Document Understanding with Gemini. Link
Google — Structured Output in Gemini. Link
Google — Safety Settings and Content Filtering. Link
Google — Grounding with Google Search. Link
Google — Gemini API REST Quickstart. Link
Google — Gemini Models Overview. Link
Google Gemini Cookbook — GitHub. Link

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Gen AI — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

Gemini API Tutorial: Multimodal AI in Python

What Is the Gemini API?

Setting Up: API Key and First Request

Prerequisites

Getting Your API Key

Multi-Turn Conversations with the Gemini API

Handling Gemini API Errors

Analyzing Images with the Gemini API

Exercise 1: Build an Image Captioner

Analyzing PDFs with the Gemini API

Gemini API Safety Settings

Gemini API Grounding with Google Search

Gemini Structured Output: JSON Mode

Exercise 2: Multi-Document Comparison

Building the Complete Document Analyzer

Common Mistakes and How to Fix Them

Mistake 1: Forgetting to base64-encode binary data

Mistake 2: Not checking the finish reason

Mistake 3: Using v1 instead of v1beta

Gemini vs GPT-4 vs Claude: Multimodal API Comparison

When NOT to Use the Gemini API

Summary

Practice Exercise

Complete Code

Frequently Asked Questions

How much does the Gemini API cost?

Can Gemini process audio and video files?

What’s the maximum size for inline_data?

How does Gemini compare to GPT-4 Vision for documents?

Is the v1beta endpoint stable enough for production?

References

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

What Is the Gemini API?

Setting Up: API Key and First Request

Prerequisites

Getting Your API Key

Multi-Turn Conversations with the Gemini API

Handling Gemini API Errors

Analyzing Images with the Gemini API

Exercise 1: Build an Image Captioner

Analyzing PDFs with the Gemini API

Gemini API Safety Settings

Gemini API Grounding with Google Search

Gemini Structured Output: JSON Mode

Exercise 2: Multi-Document Comparison

Building the Complete Document Analyzer

Common Mistakes and How to Fix Them

Mistake 1: Forgetting to base64-encode binary data

Mistake 2: Not checking the finish reason

Mistake 3: Using v1 instead of v1beta

Gemini vs GPT-4 vs Claude: Multimodal API Comparison

When NOT to Use the Gemini API

Summary

Practice Exercise

Complete Code

Frequently Asked Questions

How much does the Gemini API cost?

Can Gemini process audio and video files?

What’s the maximum size for inline_data?

How does Gemini compare to GPT-4 Vision for documents?

Is the v1beta endpoint stable enough for production?

References

Related Articles

Build a Multi-Provider LLM Toolkit (Python Project)

Groq vs Fireworks vs Together AI: Speed Benchmark

OpenAI Batch API: Process 10K Prompts at 50% Cost

Python.SQL. NumPy. All free.

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python.
SQL. NumPy.
All free.