Long-Context LLMs: Sliding Window Summarization Guide

Build a sliding window summarization pipeline in Python that handles documents exceeding LLM context windows. Map-reduce and recursive strategies with raw HTTP API calls.

Written by Selva Prabhakaran | 11 min read

Process documents that exceed your model’s context window using sliding windows, map-reduce, and recursive summarization — with raw HTTP calls you can run in the browser.

This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser.

You pasted a 50-page research paper into ChatGPT. It said “this text is too long.” You trimmed it by hand, lost key sections, and got a weak summary. Sound familiar?

Every LLM has a context window — a hard ceiling on tokens per request. GPT-4o handles 128K tokens. Claude handles 200K. But legal contracts, annual reports, and codebases regularly blow past those limits. Even when they fit, stuffing everything into one prompt wastes money and hurts quality.

The fix? Don’t send the whole document at once. Break it into chunks, summarize each one, then combine. That’s sliding window summarization.

Before we write any code, here’s how the full pipeline works.

You start with a long document — too large for one API call. First, you split it into overlapping chunks using a sliding window. The overlap keeps context connected at chunk boundaries. Each chunk fits in one API call.

Next, you send each chunk to the LLM and get a summary back. That’s the “map” step — applying a summarization function to every chunk.

Then you combine those chunk summaries into one final summary. That’s the “reduce” step. If the combined summaries are still too long, you reduce again — recursively — until the result fits in a single prompt.

The result? A coherent summary of a document 10x or 100x longer than your model’s context window. You control chunk size, overlap, and the prompt at every step.

What Is the Context Window Problem?

import micropip
await micropip.install('requests')

import json
import urllib.request
import re
import time

def estimate_tokens(text):
    """Estimate token count. Roughly 1 token per 4 characters."""
    return len(text) // 4

def generate_long_document(num_sections=20):
    """Create a synthetic long document for testing."""
    topics = [
        "neural network architecture", "gradient descent optimization",
        "transformer attention mechanisms", "tokenization strategies",
        "embedding representations", "loss function design",
        "regularization techniques", "batch normalization",
        "learning rate scheduling", "model evaluation metrics",
        "data augmentation methods", "transfer learning approaches",
        "fine-tuning best practices", "prompt engineering patterns",
        "retrieval augmented generation", "vector database indexing",
        "context window management", "inference optimization",
        "model quantization methods", "deployment strategies"
    ]
    sections = []
    for i, topic in enumerate(topics[:num_sections]):
        paragraph = (
            f"Section {i+1}: {topic.title()}. "
            f"This section covers the fundamentals of {topic} "
            f"in modern machine learning systems. "
            f"Understanding {topic} is critical for building "
            f"production-grade AI applications. "
            f"The key principles include proper configuration, "
            f"monitoring performance metrics, and iterating. "
            f"Teams that master {topic} deliver more reliable "
            f"and efficient models. Common pitfalls include "
            f"over-engineering and ignoring edge cases. "
            f"Best practices suggest starting simple, measuring "
            f"everything, and scaling only when needed. "
            f"In production, {topic} requires careful attention "
            f"to latency, throughput, and cost."
        )
        sections.append(paragraph)
    return "\n\n".join(sections)

document = generate_long_document()
token_count = estimate_tokens(document)
print(f"Document length: {len(document)} characters")
print(f"Estimated tokens: {token_count}")
print(f"First 200 chars: {document[:200]}...")

A context window is the maximum number of tokens an LLM can handle in one request. Exceed it and the API returns an error. Stay within it but push close, and you burn through your token budget fast.

Here’s what trips people up. Even when your document fits, cramming everything into one giant prompt produces worse summaries. The model’s attention spreads thin. Important details in the middle get lost — researchers call this the “lost in the middle” problem.

Key Insight: Splitting a long document into focused chunks and summarizing each one separately produces better summaries than stuffing everything into a single prompt — even when the document fits within the context window.

Prerequisites

Python version: 3.10+
Required libraries: None beyond the standard library (urllib, json, re)
API key: An OpenAI API key (set as OPENAI_API_KEY). You can swap Claude or Gemini with minimal changes.
Pyodide compatible: Yes (all HTTP calls use urllib.request)
Time to complete: ~25 minutes
Cost: A few cents in API calls (chunks are small)

How Does the Sliding Window Approach Work?

Think about reading a book through a narrow window that slides down the page. You see a few paragraphs at a time. Each slide moves forward — but keeps some overlap with what you already read. That overlap connects the context.

The sliding window chunker takes three parameters: the text, the chunk size (in characters), and the overlap size. Without overlap, you’d split ideas at the boundary and each chunk would miss context from its neighbors.

Here’s the chunker. It walks through the text in steps of chunk_size - overlap. Each new chunk starts partway through the previous one, creating that critical overlap.

def sliding_window_chunk(text, chunk_size=2000, overlap=200):
    """Split text into overlapping chunks.

    Args:
        text: The full document text
        chunk_size: Characters per chunk
        overlap: Characters shared between consecutive chunks
    Returns:
        List of text chunks with overlap
    """
    if len(text) <= chunk_size:
        return [text]
    chunks = []
    step = chunk_size - overlap
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += step
    return chunks

chunks = sliding_window_chunk(document, chunk_size=2000, overlap=300)
print(f"Number of chunks: {len(chunks)}")
for i, chunk in enumerate(chunks):
    tokens = estimate_tokens(chunk)
    print(f"  Chunk {i+1}: {len(chunk)} chars, ~{tokens} tokens")

Each chunk shares 300 characters with its neighbor — roughly 75 tokens of shared context. That overlap prevents the summarizer from missing ideas that span chunk boundaries.

Tip: How much overlap should you use? Start with 10-15% of your chunk size. Too little and you lose boundary context. Too much and you waste tokens re-processing text. For most documents, 200-400 characters works well.

Sentence-Aware Chunking

But what if you need sentence-level precision? Cutting mid-sentence produces garbage summaries. Here’s a smarter chunker that finds the nearest sentence boundary before each cut point.

def smart_sliding_window(text, chunk_size=2000, overlap=300):
    """Sliding window that respects sentence boundaries."""
    if len(text) <= chunk_size:
        return [text]
    chunks = []
    step = chunk_size - overlap
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        if end < len(text):
            last_period = text.rfind('. ', start, end)
            if last_period > start + step // 2:
                end = last_period + 2
        chunks.append(text[start:end].strip())
        next_start = end - overlap
        if next_start <= start:
            next_start = start + step
        start = next_start
    return chunks

smart_chunks = smart_sliding_window(document)
print(f"Smart chunks: {len(smart_chunks)}")
for i, chunk in enumerate(smart_chunks[:3]):
    print(f"  Chunk {i+1}: ends with '...{chunk[-40:]}'")

The smart chunker finds the nearest period-space before cutting. Your summaries come out cleaner because the LLM gets complete sentences every time.

How Do You Build the Map Phase?

With chunks ready, you send each one to an LLM for summarization. This is the “map” in map-reduce — applying the same function to every chunk independently.

We’ll use raw HTTP calls to the OpenAI API. No SDKs, no frameworks. This keeps the code portable to any environment, including Pyodide.

The call_llm function wraps a single API call. It sends a system prompt and user message, then returns the response. We use temperature=0.3 for consistent, focused summaries.

def call_llm(system_prompt, user_message, api_key,
             model="gpt-4o-mini", temperature=0.3):
    """Make a raw HTTP call to OpenAI's chat completions API."""
    url = "https://api.openai.com/v1/chat/completions"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }
    payload = {
        "model": model,
        "temperature": temperature,
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ]
    }
    data = json.dumps(payload).encode("utf-8")
    req = urllib.request.Request(url, data=data, headers=headers)
    with urllib.request.urlopen(req, timeout=60) as resp:
        result = json.loads(resp.read().decode("utf-8"))
    return result["choices"][0]["message"]["content"]

That’s the building block. Every strategy we build calls this function.

The map function takes a list of chunks and summarizes each one. The prompt preserves key facts, numbers, and conclusions — the details that matter most.

def map_summarize(chunks, api_key, model="gpt-4o-mini"):
    """Summarize each chunk independently (map phase)."""
    system_prompt = (
        "You are a precise summarizer. Condense the text into "
        "a clear summary that preserves all key facts, numbers, "
        "names, and conclusions. Keep it under 150 words."
    )
    summaries = []
    for i, chunk in enumerate(chunks):
        prompt = f"Summarize this text:\n\n{chunk}"
        summary = call_llm(system_prompt, prompt, api_key, model)
        summaries.append(summary)
        print(f"  Chunk {i+1}/{len(chunks)} summarized "
              f"({estimate_tokens(summary)} tokens)")
    return summaries

Each chunk gets its own API call. The summaries return as a list — one per chunk, in order.

Warning: Rate limits will bite you on large documents. If you’re processing 50+ chunks, add a `time.sleep(0.5)` between calls. OpenAI’s rate limits vary by tier — check your dashboard before running bulk jobs.

How Does Map-Reduce Combine the Results?

You’ve got chunk summaries. The reduce step merges them into one coherent final summary.

The approach: concatenate all summaries and ask the LLM to synthesize them. If the concatenated summaries exceed the context window, chunk the summaries themselves and reduce again. That’s the recursive safety net.

The reduce_summarize function checks whether combined summaries fit in one prompt. If they don’t, it chunks and reduces again — automatically.

def reduce_summarize(summaries, api_key, model="gpt-4o-mini",
                     max_context_chars=6000):
    """Combine chunk summaries into a final summary.

    Recursively reduces if combined summaries are too long.
    """
    combined = "\n\n".join(summaries)
    if len(combined) <= max_context_chars:
        system_prompt = (
            "You are a document summarizer. Combine these section "
            "summaries into one coherent summary. Eliminate "
            "redundancy. Preserve key facts and conclusions."
        )
        prompt = (
            f"Combine these {len(summaries)} section summaries "
            f"into one final summary:\n\n{combined}"
        )
        return call_llm(system_prompt, prompt, api_key, model)

    print(f"  Combined too long ({len(combined)} chars), "
          f"reducing recursively...")
    sub_chunks = sliding_window_chunk(
        combined, chunk_size=max_context_chars, overlap=500
    )
    sub_summaries = map_summarize(sub_chunks, api_key, model)
    return reduce_summarize(sub_summaries, api_key, model,
                           max_context_chars)

Here’s the full pipeline function that ties chunk, map, and reduce together.

def map_reduce_pipeline(document, api_key, chunk_size=2000,
                        overlap=300, model="gpt-4o-mini"):
    """Full map-reduce summarization pipeline."""
    print("Step 1: Chunking document...")
    chunks = smart_sliding_window(document, chunk_size, overlap)
    print(f"  Created {len(chunks)} chunks\n")

    print("Step 2: Summarizing chunks (map)...")
    summaries = map_summarize(chunks, api_key, model)
    print(f"  Got {len(summaries)} summaries\n")

    print("Step 3: Combining summaries (reduce)...")
    final_summary = reduce_summarize(summaries, api_key, model)
    print("  Done!\n")
    return final_summary, summaries

Three steps. Chunk, map, reduce. If the reduce step hits summaries that are too long, it recurses. You don’t manage that edge case yourself.

Key Insight: Map-reduce summarization mirrors the distributed computing pattern. Map = process parts independently. Reduce = combine results. The recursion handles arbitrarily long documents because each round shrinks the total text.

What Is Recursive Summarization?

Map-reduce runs one map pass, then one reduce. Recursive summarization works differently — it compresses in layers, like zooming out on a map.

First, summarize chunks into paragraph-level summaries. Then summarize those paragraphs into section-level summaries. Keep going until the result fits in one prompt. Each layer compresses by a fixed ratio.

Why bother? Two reasons. It preserves hierarchical structure better. And it handles truly massive documents — full books, entire codebases — where even the first round of summaries is too long for one reduce call.

def recursive_summarize(document, api_key,
                        chunk_size=2000, overlap=300,
                        target_length=500,
                        model="gpt-4o-mini", depth=0):
    """Recursively summarize until result fits target_length."""
    indent = "  " * depth
    print(f"{indent}Depth {depth}: "
          f"length = {len(document)} chars")

    if len(document) <= target_length:
        print(f"{indent}  -> Fits target. Done.")
        return document

    chunks = smart_sliding_window(document, chunk_size, overlap)
    print(f"{indent}  -> Split into {len(chunks)} chunks")

    summaries = []
    for chunk in chunks:
        system_prompt = (
            "Condense this text to about one-third its length. "
            "Preserve key facts, numbers, and conclusions."
        )
        summary = call_llm(
            system_prompt, f"Summarize:\n\n{chunk}",
            api_key, model
        )
        summaries.append(summary)

    combined = "\n\n".join(summaries)
    print(f"{indent}  -> Combined: {len(combined)} chars")

    return recursive_summarize(
        combined, api_key, chunk_size, overlap,
        target_length, model, depth + 1
    )

Watch the depth counter. Each recursion prints its level. A 50,000-character document might take 3 rounds to reach a 500-character summary. You set the target length and let it run.

Which Strategy Should You Choose?

Here’s a comparison function that runs both strategies on the same document.

def compare_strategies(document, api_key, model="gpt-4o-mini"):
    """Run map-reduce and recursive on the same document."""
    results = {}

    print("=" * 50)
    print("STRATEGY 1: Map-Reduce")
    print("=" * 50)
    start = time.time()
    mr_summary, mr_chunks = map_reduce_pipeline(
        document, api_key, 2000, 300, model
    )
    results["map_reduce"] = {
        "summary": mr_summary,
        "api_calls": len(mr_chunks) + 1,
        "time": time.time() - start,
        "length": len(mr_summary)
    }

    print("=" * 50)
    print("STRATEGY 2: Recursive")
    print("=" * 50)
    start = time.time()
    rec_summary = recursive_summarize(
        document, api_key, 2000, 300, 800, model
    )
    results["recursive"] = {
        "summary": rec_summary,
        "time": time.time() - start,
        "length": len(rec_summary)
    }
    return results

Use compare_strategies(document, api_key) to see both in action. Map-reduce is faster for moderate documents — fewer total API calls. Recursive shines on very long documents where chunk summaries themselves overflow.

Here’s the decision guide:

Scenario	Best Strategy	Why
1-2 chunks	Direct summarization	No chunking overhead needed
5-50 chunks	Map-reduce	One map + one reduce is efficient
50+ chunks	Recursive	Handles any length gracefully
Hierarchical structure needed	Recursive	Natural multi-level compression
Speed over perfection	Map-reduce	Fewer total API calls

Tip: Cut API costs with a hybrid approach. Use `gpt-4o-mini` for the map phase (cheap bulk work) and `gpt-4o` for the final reduce step (quality matters most there). You get the best of both worlds.

{
type: ‘exercise’,
id: ‘sliding-window-ex1’,
title: ‘Exercise 1: Build a Token-Aware Chunker’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘The sliding_window_chunk function splits by character count. But API billing uses tokens. Write a function token_aware_chunk(text, max_tokens=500, overlap_tokens=50) that converts token limits to character limits (1 token = 4 chars) and calls sliding_window_chunk. Print chunk count and each chunk estimated token count.’,
starterCode: `def token_aware_chunk(text, max_tokens=500, overlap_tokens=50):
“””Split text into chunks based on token count.”””
chunk_size_chars = max_tokens * 4
overlap_chars = overlap_tokens * 4
# YOUR CODE: call sliding_window_chunk with char equivalents
pass

chunks = token_aware_chunk(document, max_tokens=500, overlap_tokens=50)
print(f”Number of chunks: {len(chunks)}”)
for i, c in enumerate(chunks):
print(f” Chunk {i+1}: ~{estimate_tokens(c)} tokens”), testCases: [ { id: 'tc1', input: 'chunks = token_aware_chunk(document, max_tokens=500, overlap_tokens=50)\nprint(len(chunks))', expectedOutput: '7', description: 'Should produce 7 chunks for our test document' }, { id: 'tc2', input: 'chunks = token_aware_chunk("short", max_tokens=500)\nprint(len(chunks))', expectedOutput: '1', description: 'Short text should produce 1 chunk' }, ], hints: [ 'Convert tokens to characters by multiplying by 4, then pass both values to sliding_window_chunk.', 'return sliding_window_chunk(text, chunk_size_chars, overlap_chars)', ], solution:def token_aware_chunk(text, max_tokens=500, overlap_tokens=50):
chunk_size_chars = max_tokens * 4
overlap_chars = overlap_tokens * 4
return sliding_window_chunk(text, chunk_size_chars, overlap_chars)

chunks = token_aware_chunk(document, max_tokens=500, overlap_tokens=50)
print(f”Number of chunks: {len(chunks)}”)
for i, c in enumerate(chunks):
print(f” Chunk {i+1}: ~{estimate_tokens(c)} tokens”)`,
solutionExplanation: ‘Since our estimator uses 1 token per 4 characters, multiply token limits by 4 to get character limits. Then reuse sliding_window_chunk. This keeps chunking aligned with API billing.’,
xpReward: 15,
}

How Do You Handle Edge Cases in Production?

Real documents aren’t clean text files. They have headers, tables, code blocks, and formatting that can trip up your chunker. Here are the patterns that cause problems.

Structure-Aware Chunking

The first issue: chunk boundaries splitting key information. A table spanning a boundary produces two useless halves. The fix — detect structural markers like headers and paragraph breaks, and split there instead.

def structure_aware_chunker(text, chunk_size=2000, overlap=300):
    """Chunk text at paragraph/section boundaries."""
    sections = re.split(r'\n\n+', text)
    chunks = []
    current_chunk = ""
    for section in sections:
        if (len(current_chunk) + len(section) > chunk_size
                and current_chunk):
            chunks.append(current_chunk.strip())
            current_chunk = current_chunk[-overlap:] + "\n\n"
        current_chunk += section + "\n\n"
    if current_chunk.strip():
        chunks.append(current_chunk.strip())
    return chunks

struct_chunks = structure_aware_chunker(document)
print(f"Structure-aware chunks: {len(struct_chunks)}")
for i, c in enumerate(struct_chunks[:3]):
    print(f"  Chunk {i+1}: {len(c)} chars")

Filtering Tiny Chunks

Empty or near-empty chunks waste API calls and add noise. Filter them before the map phase.

def filter_chunks(chunks, min_chars=100):
    """Remove chunks too short to summarize."""
    filtered = [c for c in chunks if len(c) >= min_chars]
    removed = len(chunks) - len(filtered)
    if removed:
        print(f"Filtered {removed} chunks under {min_chars} chars")
    return filtered

clean_chunks = filter_chunks(struct_chunks, min_chars=100)
print(f"After filtering: {len(clean_chunks)} chunks")

Relevance Scoring

Not every chunk matters equally. A conclusion matters more than boilerplate acknowledgements. A simple heuristic scores chunks by signal words — chunks with higher scores get a longer summary allocation.

def score_chunk_relevance(chunk):
    """Score chunk importance using signal words."""
    signals = [
        "conclusion", "result", "finding", "significant",
        "important", "key", "critical", "recommend",
        "therefore", "in summary", "we found"
    ]
    text_lower = chunk.lower()
    return sum(1 for w in signals if w in text_lower)

for i, chunk in enumerate(clean_chunks[:5]):
    score = score_chunk_relevance(chunk)
    print(f"Chunk {i+1}: relevance = {score}")

Resilient API Calls

One timeout shouldn’t kill your entire pipeline. Add retry logic with exponential backoff.

def resilient_call(system, message, api_key, retries=3):
    """Call LLM with retry logic for production use."""
    for attempt in range(retries):
        try:
            return call_llm(system, message, api_key)
        except Exception as e:
            if attempt < retries - 1:
                wait = 2 ** attempt
                print(f"  Retry {attempt+1} in {wait}s: {e}")
                time.sleep(wait)
            else:
                return f"[SUMMARY FAILED: {str(e)[:100]}]"

Warning: Don’t skip overlap in production. It’s tempting to set overlap to 0 for speed. Without overlap, you miss every insight that spans two chunks. A 10-15% overlap is cheap insurance.

{
type: ‘exercise’,
id: ‘sliding-window-ex2’,
title: ‘Exercise 2: Track Compression Metadata’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Write a function map_with_metadata(chunks) that simulates summarization (take first 25% of each chunk) and returns a list of dicts with: chunk_index (int), original_length (chars), summary_length (chars), compression_ratio (original / summary, rounded to 1 decimal). Print each chunk metadata.’,
starterCode: `def map_with_metadata(chunks):
“””Summarize chunks and track metadata.”””
results = []
for i, chunk in enumerate(chunks):
summary = chunk[:len(chunk) // 4] # simulate summary
# YOUR CODE: build dict with chunk_index, original_length,
# summary_length, compression_ratio
pass
return results

chunks = sliding_window_chunk(document, chunk_size=2000, overlap=300)
meta = map_with_metadata(chunks)
for m in meta:
print(f”Chunk {m[‘chunk_index’]}: {m[‘original_length’]} -> ”
f”{m[‘summary_length’]} chars ({m[‘compression_ratio’]}x)”), testCases: [ { id: 'tc1', input: 'chunks = sliding_window_chunk(document, 2000, 300)\nm = map_with_metadata(chunks)\nprint(m[0]["chunk_index"])', expectedOutput: '0' }, { id: 'tc2', input: 'chunks = sliding_window_chunk(document, 2000, 300)\nm = map_with_metadata(chunks)\nprint(type(m[0]["compression_ratio"]).__name__)', expectedOutput: 'float' }, ], hints: [ 'Build a dict: {"chunk_index": i, "original_length": len(chunk), ...}. Use round(len(chunk)/len(summary), 1) for ratio.', 'Append {"chunk_index": i, "original_length": len(chunk), "summary_length": len(summary), "compression_ratio": round(len(chunk)/len(summary), 1)} to results.', ], solution:def map_with_metadata(chunks):
results = []
for i, chunk in enumerate(chunks):
summary = chunk[:len(chunk) // 4]
results.append({
“chunk_index”: i,
“original_length”: len(chunk),
“summary_length”: len(summary),
“compression_ratio”: round(len(chunk) / len(summary), 1)
})
return results

When Should You NOT Use Sliding Window Summarization?

This approach isn’t always the right call. Here are three cases where you should pick a different strategy.

When the document fits comfortably in context. If your document is 50K tokens and your model handles 128K, just send it directly. Chunking adds complexity and API calls for no benefit.

When order and nuance matter more than coverage. Sliding window summarization compresses aggressively. If you need to preserve the exact argument flow of a legal brief or the narrative arc of a story, consider extractive summarization instead — it pulls key sentences verbatim rather than rewriting.

When you need real-time speed. Each chunk requires a separate API call. A 100-chunk document means 100+ API calls. If you need sub-second responses (like in a chat interface), use truncation or pre-computed summaries. Sliding window summarization is a batch operation, not a real-time one.

Note: Alternative approaches: For documents that fit in context but need focused summaries, try chain-of-density prompting — it progressively adds detail to a summary. For code-heavy documents, consider AST-based chunking that splits at function or class boundaries rather than character counts.

Common Mistakes and How to Fix Them

Mistake 1: No overlap between chunks

❌ Wrong:

bad_chunks = sliding_window_chunk(
    document, chunk_size=2000, overlap=0
)

Ideas spanning chunk boundaries get split. The summarizer sees half a thought in each chunk. The final summary misses those ideas.

✅ Correct:

good_chunks = sliding_window_chunk(
    document, chunk_size=2000, overlap=300
)

Mistake 2: No error handling on API calls

❌ Wrong:

for chunk in chunks:
    summary = call_llm(system, chunk, api_key)  # one failure kills all

One rate limit error kills the pipeline. You lose every summary already generated.

✅ Correct:

for chunk in chunks:
    summary = resilient_call(system, chunk, api_key, retries=3)

Mistake 3: Same prompt for map and reduce

❌ Wrong:

prompt = "Summarize this text."  # used for both phases

The map phase extracts facts from raw text. The reduce phase synthesizes and deduplicates across summaries. Different jobs need different instructions.

✅ Correct:

map_prompt = "Extract key facts, numbers, and conclusions."
reduce_prompt = ("Combine these section summaries into one "
                 "coherent summary. Eliminate redundancy.")

The Complete Pipeline Class

Here’s everything wrapped in a class. Configure once, then call summarize() on any document. This uses structure-aware chunking, filtering, and your choice of strategy.

class SlidingWindowSummarizer:
    """Production-ready sliding window summarizer."""

    def __init__(self, api_key, model="gpt-4o-mini",
                 chunk_size=2000, overlap=300):
        self.api_key = api_key
        self.model = model
        self.chunk_size = chunk_size
        self.overlap = overlap

    def summarize(self, document, strategy="map_reduce"):
        """Summarize a document with the chosen strategy."""
        print(f"Strategy: {strategy}")
        print(f"Document: {len(document)} chars, "
              f"~{estimate_tokens(document)} tokens\n")

        chunks = structure_aware_chunker(
            document, self.chunk_size, self.overlap
        )
        chunks = filter_chunks(chunks, min_chars=100)

        if strategy == "map_reduce":
            sums = map_summarize(chunks, self.api_key, self.model)
            return reduce_summarize(sums, self.api_key, self.model)
        elif strategy == "recursive":
            return recursive_summarize(
                document, self.api_key,
                self.chunk_size, self.overlap,
                target_length=800, model=self.model
            )

# Usage (requires API key):
# summarizer = SlidingWindowSummarizer(api_key="sk-...")
# result = summarizer.summarize(document, "map_reduce")
# print(result)

Complete Code

Click to expand the full script (copy-paste and run)

# Long-Context Summarization — Sliding Window Pipeline
# Requires: no external dependencies (stdlib only)
# Python 3.10+ | Set OPENAI_API_KEY environment variable

import json
import urllib.request
import re
import time

def estimate_tokens(text):
    return len(text) // 4

def generate_long_document(num_sections=20):
    topics = [
        "neural network architecture", "gradient descent optimization",
        "transformer attention mechanisms", "tokenization strategies",
        "embedding representations", "loss function design",
        "regularization techniques", "batch normalization",
        "learning rate scheduling", "model evaluation metrics",
        "data augmentation methods", "transfer learning approaches",
        "fine-tuning best practices", "prompt engineering patterns",
        "retrieval augmented generation", "vector database indexing",
        "context window management", "inference optimization",
        "model quantization methods", "deployment strategies"
    ]
    sections = []
    for i, topic in enumerate(topics[:num_sections]):
        sections.append(
            f"Section {i+1}: {topic.title()}. "
            f"This section covers the fundamentals of {topic} "
            f"in modern ML systems. Understanding {topic} is critical "
            f"for production AI. Key principles: configuration, "
            f"monitoring, and iteration. Common pitfalls include "
            f"over-engineering and ignoring edge cases. "
            f"Start simple, measure everything, scale when needed. "
            f"In production, {topic} requires attention to latency, "
            f"throughput, and cost."
        )
    return "\n\n".join(sections)

def sliding_window_chunk(text, chunk_size=2000, overlap=200):
    if len(text) <= chunk_size:
        return [text]
    chunks, step, start = [], chunk_size - overlap, 0
    while start < len(text):
        chunks.append(text[start:start + chunk_size])
        start += step
    return chunks

def smart_sliding_window(text, chunk_size=2000, overlap=300):
    if len(text) <= chunk_size:
        return [text]
    chunks, step, start = [], chunk_size - overlap, 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        if end < len(text):
            lp = text.rfind('. ', start, end)
            if lp > start + step // 2:
                end = lp + 2
        chunks.append(text[start:end].strip())
        ns = end - overlap
        start = ns if ns > start else start + step
    return chunks

def structure_aware_chunker(text, chunk_size=2000, overlap=300):
    sections = re.split(r'\n\n+', text)
    chunks, current = [], ""
    for section in sections:
        if len(current) + len(section) > chunk_size and current:
            chunks.append(current.strip())
            current = current[-overlap:] + "\n\n"
        current += section + "\n\n"
    if current.strip():
        chunks.append(current.strip())
    return chunks

def filter_chunks(chunks, min_chars=100):
    return [c for c in chunks if len(c) >= min_chars]

def call_llm(system_prompt, user_message, api_key,
             model="gpt-4o-mini", temperature=0.3):
    url = "https://api.openai.com/v1/chat/completions"
    headers = {"Content-Type": "application/json",
               "Authorization": f"Bearer {api_key}"}
    payload = {"model": model, "temperature": temperature,
               "messages": [{"role": "system", "content": system_prompt},
                           {"role": "user", "content": user_message}]}
    data = json.dumps(payload).encode("utf-8")
    req = urllib.request.Request(url, data=data, headers=headers)
    with urllib.request.urlopen(req, timeout=60) as resp:
        result = json.loads(resp.read().decode("utf-8"))
    return result["choices"][0]["message"]["content"]

def resilient_call(system, message, api_key, retries=3):
    for attempt in range(retries):
        try:
            return call_llm(system, message, api_key)
        except Exception as e:
            if attempt < retries - 1:
                time.sleep(2 ** attempt)
            else:
                return f"[FAILED: {str(e)[:100]}]"

def map_summarize(chunks, api_key, model="gpt-4o-mini"):
    system = ("Precise summarizer. Preserve key facts, numbers, "
              "and conclusions. Under 150 words.")
    summaries = []
    for i, chunk in enumerate(chunks):
        summaries.append(call_llm(system, f"Summarize:\n\n{chunk}",
                                  api_key, model))
        print(f"  Chunk {i+1}/{len(chunks)} done")
    return summaries

def reduce_summarize(summaries, api_key, model="gpt-4o-mini",
                     max_chars=6000):
    combined = "\n\n".join(summaries)
    if len(combined) <= max_chars:
        return call_llm(
            "Combine summaries. Eliminate redundancy. Keep key facts.",
            f"Combine {len(summaries)} summaries:\n\n{combined}",
            api_key, model)
    sub = sliding_window_chunk(combined, max_chars, 500)
    return reduce_summarize(map_summarize(sub, api_key, model),
                           api_key, model, max_chars)

def map_reduce_pipeline(doc, api_key, cs=2000, ov=300, model="gpt-4o-mini"):
    chunks = smart_sliding_window(doc, cs, ov)
    print(f"Chunks: {len(chunks)}")
    sums = map_summarize(chunks, api_key, model)
    return reduce_summarize(sums, api_key, model), sums

def recursive_summarize(doc, api_key, cs=2000, ov=300,
                        target=500, model="gpt-4o-mini", depth=0):
    if len(doc) <= target:
        return doc
    chunks = smart_sliding_window(doc, cs, ov)
    sums = [call_llm("Condense to 1/3 length. Keep key facts.",
                     f"Summarize:\n\n{c}", api_key, model)
            for c in chunks]
    return recursive_summarize("\n\n".join(sums), api_key, cs, ov,
                               target, model, depth + 1)

class SlidingWindowSummarizer:
    def __init__(self, api_key, model="gpt-4o-mini",
                 chunk_size=2000, overlap=300):
        self.api_key = api_key
        self.model = model
        self.chunk_size = chunk_size
        self.overlap = overlap

    def summarize(self, document, strategy="map_reduce"):
        chunks = structure_aware_chunker(document, self.chunk_size,
                                         self.overlap)
        chunks = filter_chunks(chunks)
        if strategy == "map_reduce":
            sums = map_summarize(chunks, self.api_key, self.model)
            return reduce_summarize(sums, self.api_key, self.model)
        return recursive_summarize(document, self.api_key,
                                    self.chunk_size, self.overlap,
                                    800, self.model)

document = generate_long_document()
print(f"Document: {len(document)} chars, ~{estimate_tokens(document)} tokens")
print("Set api_key and run:")
print("  map_reduce_pipeline(document, api_key)")
print("  recursive_summarize(document, api_key)")
print("Script completed successfully.")

Summary

You’ve built a complete sliding window summarization pipeline. Here’s what you know now:

Sliding window chunking splits documents into overlapping pieces. The overlap prevents losing ideas at boundaries.
Map-reduce summarization processes chunks independently, then combines. Fast for moderate documents.
Recursive summarization compresses in layers until the result fits. Handles arbitrarily long documents.
Structure-aware chunking respects paragraph and section boundaries for cleaner results.
Production hardening means retry logic, chunk filtering, and relevance scoring.

Use map-reduce for most jobs. Switch to recursive when chunk summaries themselves overflow.

Practice Exercise

Challenge: Build a quality-checking reduce step

**Task:** Modify `reduce_summarize` to include a quality check. After the initial summary, make a second API call asking: “Does this summary miss critical information?” If it finds gaps, generate an improved version.

**Solution:**

def quality_checked_reduce(summaries, api_key):
    combined = "\n\n".join(summaries)
    summary = call_llm(
        "Combine into one coherent summary.",
        f"Summaries:\n\n{combined}", api_key
    )
    review = call_llm(
        "Summary reviewer.",
        f"Originals:\n{combined}\n\nSummary:\n{summary}\n\n"
        f"What key information is missing?", api_key
    )
    if "missing" in review.lower() or "omit" in review.lower():
        return call_llm(
            "Improve this summary with the missing info.",
            f"Summary:\n{summary}\n\nGaps:\n{review}",
            api_key
        )
    return summary

Two passes catch the biggest weakness of single-pass summarization: silently dropping details. One extra API call, significantly better completeness.

Frequently Asked Questions

How do you choose the right chunk size for your document?

Start with your model’s context limit. Subtract the system prompt and expected output length. Use 60-70% of what remains. For GPT-4o-mini at 128K context, that’s roughly 70K tokens per chunk. But smaller chunks (500-1000 tokens) often produce better summaries because the model focuses on less text.

Can you use this pipeline with Claude or Gemini instead of OpenAI?

Absolutely. Change the URL and headers in call_llm. For Claude, use https://api.anthropic.com/v1/messages with x-api-key and anthropic-version headers. For Gemini, use the Google AI endpoint. The chunking and pipeline logic stays identical.

How do you handle documents with images or tables?

Extract text first using a document parser — pymupdf for PDFs, BeautifulSoup for HTML. Convert tables to markdown format before chunking. Images require a multimodal model (GPT-4o vision or Claude 3.5) in a separate step.

What’s the maximum document size this pipeline handles?

No hard limit. Recursive summarization handles any length because each round compresses. A million-word document might need 4-5 rounds. The practical limit is cost and API rate limits, not the pipeline itself.

References

OpenAI API documentation — Chat Completions. Link
Anthropic documentation — Messages API. Link
Liu, N. F. et al. — “Lost in the Middle: How Language Models Use Long Contexts.” TACL (2024). Link
Chang, Y. et al. — “BooookScore: A systematic exploration of book-length summarization.” EMNLP (2024). Link
OpenAI — tiktoken tokenizer library. Link
Dean, J. and Ghemawat, S. — “MapReduce: Simplified Data Processing on Large Clusters.” OSDI (2004). Link
Google AI documentation — Gemini API. Link

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Gen AI — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

Long-Context LLMs: Sliding Window Summarization Guide

What Is the Context Window Problem?

Prerequisites

How Does the Sliding Window Approach Work?

Sentence-Aware Chunking

How Do You Build the Map Phase?

How Does Map-Reduce Combine the Results?

What Is Recursive Summarization?

Which Strategy Should You Choose?

How Do You Handle Edge Cases in Production?

Structure-Aware Chunking

Filtering Tiny Chunks

Relevance Scoring

Resilient API Calls

When Should You NOT Use Sliding Window Summarization?

Common Mistakes and How to Fix Them

Mistake 1: No overlap between chunks

Mistake 2: No error handling on API calls

Mistake 3: Same prompt for map and reduce

The Complete Pipeline Class

Complete Code

Summary

Practice Exercise

Frequently Asked Questions

How do you choose the right chunk size for your document?

Can you use this pipeline with Claude or Gemini instead of OpenAI?

How do you handle documents with images or tables?

What’s the maximum document size this pipeline handles?

References

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

What Is the Context Window Problem?

Prerequisites

How Does the Sliding Window Approach Work?

Sentence-Aware Chunking

How Do You Build the Map Phase?

How Does Map-Reduce Combine the Results?

What Is Recursive Summarization?

Which Strategy Should You Choose?

How Do You Handle Edge Cases in Production?

Structure-Aware Chunking

Filtering Tiny Chunks

Relevance Scoring

Resilient API Calls

When Should You NOT Use Sliding Window Summarization?

Common Mistakes and How to Fix Them

Mistake 1: No overlap between chunks

Mistake 2: No error handling on API calls

Mistake 3: Same prompt for map and reduce

The Complete Pipeline Class

Complete Code

Summary

Practice Exercise

Frequently Asked Questions

How do you choose the right chunk size for your document?

Can you use this pipeline with Claude or Gemini instead of OpenAI?

How do you handle documents with images or tables?

What’s the maximum document size this pipeline handles?

References

Related Articles

LLM Context Windows Explained: Token Budget Guide

Zero-Shot vs Few-Shot Prompting: Complete Guide

Prompt Engineering Tutorial for Beginners (2026)

Python.SQL. NumPy. All free.

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python.
SQL. NumPy.
All free.