machine learning +
LLM Context Windows Explained: Token Budget Guide
Long-Context LLMs: Sliding Window Summarization Guide
Build a sliding window summarization pipeline in Python that handles documents exceeding LLM context windows. Map-reduce and recursive strategies with raw HTTP API calls.
Process documents that exceed your model’s context window using sliding windows, map-reduce, and recursive summarization — with raw HTTP calls you can run in the browser.
This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser.
You pasted a 50-page research paper into ChatGPT. It said “this text is too long.” You trimmed it by hand, lost key sections, and got a weak summary. Sound familiar?
Every LLM has a context window — a hard ceiling on tokens per request. GPT-4o handles 128K tokens. Claude handles 200K. But legal contracts, annual reports, and codebases regularly blow past those limits. Even when they fit, stuffing everything into one prompt wastes money and hurts quality.
The fix? Don’t send the whole document at once. Break it into chunks, summarize each one, then combine. That’s sliding window summarization.
Before we write any code, here’s how the full pipeline works.
You start with a long document — too large for one API call. First, you split it into overlapping chunks using a sliding window. The overlap keeps context connected at chunk boundaries. Each chunk fits in one API call.
Next, you send each chunk to the LLM and get a summary back. That’s the “map” step — applying a summarization function to every chunk.
Then you combine those chunk summaries into one final summary. That’s the “reduce” step. If the combined summaries are still too long, you reduce again — recursively — until the result fits in a single prompt.
The result? A coherent summary of a document 10x or 100x longer than your model’s context window. You control chunk size, overlap, and the prompt at every step.
What Is the Context Window Problem?
import micropip
await micropip.install('requests')
import json
import urllib.request
import re
import time
def estimate_tokens(text):
"""Estimate token count. Roughly 1 token per 4 characters."""
return len(text) // 4
def generate_long_document(num_sections=20):
"""Create a synthetic long document for testing."""
topics = [
"neural network architecture", "gradient descent optimization",
"transformer attention mechanisms", "tokenization strategies",
"embedding representations", "loss function design",
"regularization techniques", "batch normalization",
"learning rate scheduling", "model evaluation metrics",
"data augmentation methods", "transfer learning approaches",
"fine-tuning best practices", "prompt engineering patterns",
"retrieval augmented generation", "vector database indexing",
"context window management", "inference optimization",
"model quantization methods", "deployment strategies"
]
sections = []
for i, topic in enumerate(topics[:num_sections]):
paragraph = (
f"Section {i+1}: {topic.title()}. "
f"This section covers the fundamentals of {topic} "
f"in modern machine learning systems. "
f"Understanding {topic} is critical for building "
f"production-grade AI applications. "
f"The key principles include proper configuration, "
f"monitoring performance metrics, and iterating. "
f"Teams that master {topic} deliver more reliable "
f"and efficient models. Common pitfalls include "
f"over-engineering and ignoring edge cases. "
f"Best practices suggest starting simple, measuring "
f"everything, and scaling only when needed. "
f"In production, {topic} requires careful attention "
f"to latency, throughput, and cost."
)
sections.append(paragraph)
return "\n\n".join(sections)
document = generate_long_document()
token_count = estimate_tokens(document)
print(f"Document length: {len(document)} characters")
print(f"Estimated tokens: {token_count}")
print(f"First 200 chars: {document[:200]}...")
A context window is the maximum number of tokens an LLM can handle in one request. Exceed it and the API returns an error. Stay within it but push close, and you burn through your token budget fast.
Here’s what trips people up. Even when your document fits, cramming everything into one giant prompt produces worse summaries. The model’s attention spreads thin. Important details in the middle get lost — researchers call this the “lost in the middle” problem.
Key Insight: Splitting a long document into focused chunks and summarizing each one separately produces better summaries than stuffing everything into a single prompt — even when the document fits within the context window.
Prerequisites
- Python version: 3.10+
- Required libraries: None beyond the standard library (
urllib,json,re) - API key: An OpenAI API key (set as
OPENAI_API_KEY). You can swap Claude or Gemini with minimal changes. - Pyodide compatible: Yes (all HTTP calls use
urllib.request) - Time to complete: ~25 minutes
- Cost: A few cents in API calls (chunks are small)
How Does the Sliding Window Approach Work?
Think about reading a book through a narrow window that slides down the page. You see a few paragraphs at a time. Each slide moves forward — but keeps some overlap with what you already read. That overlap connects the context.
The sliding window chunker takes three parameters: the text, the chunk size (in characters), and the overlap size. Without overlap, you’d split ideas at the boundary and each chunk would miss context from its neighbors.
Here’s the chunker. It walks through the text in steps of chunk_size - overlap. Each new chunk starts partway through the previous one, creating that critical overlap.
def sliding_window_chunk(text, chunk_size=2000, overlap=200):
"""Split text into overlapping chunks.
Args:
text: The full document text
chunk_size: Characters per chunk
overlap: Characters shared between consecutive chunks
Returns:
List of text chunks with overlap
"""
if len(text) <= chunk_size:
return [text]
chunks = []
step = chunk_size - overlap
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start += step
return chunks
chunks = sliding_window_chunk(document, chunk_size=2000, overlap=300)
print(f"Number of chunks: {len(chunks)}")
for i, chunk in enumerate(chunks):
tokens = estimate_tokens(chunk)
print(f" Chunk {i+1}: {len(chunk)} chars, ~{tokens} tokens")
Each chunk shares 300 characters with its neighbor — roughly 75 tokens of shared context. That overlap prevents the summarizer from missing ideas that span chunk boundaries.
Tip: How much overlap should you use? Start with 10-15% of your chunk size. Too little and you lose boundary context. Too much and you waste tokens re-processing text. For most documents, 200-400 characters works well.
Sentence-Aware Chunking
But what if you need sentence-level precision? Cutting mid-sentence produces garbage summaries. Here’s a smarter chunker that finds the nearest sentence boundary before each cut point.
def smart_sliding_window(text, chunk_size=2000, overlap=300):
"""Sliding window that respects sentence boundaries."""
if len(text) <= chunk_size:
return [text]
chunks = []
step = chunk_size - overlap
start = 0
while start < len(text):
end = min(start + chunk_size, len(text))
if end < len(text):
last_period = text.rfind('. ', start, end)
if last_period > start + step // 2:
end = last_period + 2
chunks.append(text[start:end].strip())
next_start = end - overlap
if next_start <= start:
next_start = start + step
start = next_start
return chunks
smart_chunks = smart_sliding_window(document)
print(f"Smart chunks: {len(smart_chunks)}")
for i, chunk in enumerate(smart_chunks[:3]):
print(f" Chunk {i+1}: ends with '...{chunk[-40:]}'")
The smart chunker finds the nearest period-space before cutting. Your summaries come out cleaner because the LLM gets complete sentences every time.
How Do You Build the Map Phase?
With chunks ready, you send each one to an LLM for summarization. This is the “map” in map-reduce — applying the same function to every chunk independently.
We’ll use raw HTTP calls to the OpenAI API. No SDKs, no frameworks. This keeps the code portable to any environment, including Pyodide.
The call_llm function wraps a single API call. It sends a system prompt and user message, then returns the response. We use temperature=0.3 for consistent, focused summaries.
def call_llm(system_prompt, user_message, api_key,
model="gpt-4o-mini", temperature=0.3):
"""Make a raw HTTP call to OpenAI's chat completions API."""
url = "https://api.openai.com/v1/chat/completions"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {api_key}"
}
payload = {
"model": model,
"temperature": temperature,
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message}
]
}
data = json.dumps(payload).encode("utf-8")
req = urllib.request.Request(url, data=data, headers=headers)
with urllib.request.urlopen(req, timeout=60) as resp:
result = json.loads(resp.read().decode("utf-8"))
return result["choices"][0]["message"]["content"]
That’s the building block. Every strategy we build calls this function.
The map function takes a list of chunks and summarizes each one. The prompt preserves key facts, numbers, and conclusions — the details that matter most.
def map_summarize(chunks, api_key, model="gpt-4o-mini"):
"""Summarize each chunk independently (map phase)."""
system_prompt = (
"You are a precise summarizer. Condense the text into "
"a clear summary that preserves all key facts, numbers, "
"names, and conclusions. Keep it under 150 words."
)
summaries = []
for i, chunk in enumerate(chunks):
prompt = f"Summarize this text:\n\n{chunk}"
summary = call_llm(system_prompt, prompt, api_key, model)
summaries.append(summary)
print(f" Chunk {i+1}/{len(chunks)} summarized "
f"({estimate_tokens(summary)} tokens)")
return summaries
Each chunk gets its own API call. The summaries return as a list — one per chunk, in order.
Warning: Rate limits will bite you on large documents. If you’re processing 50+ chunks, add a `time.sleep(0.5)` between calls. OpenAI’s rate limits vary by tier — check your dashboard before running bulk jobs.
How Does Map-Reduce Combine the Results?
You’ve got chunk summaries. The reduce step merges them into one coherent final summary.
The approach: concatenate all summaries and ask the LLM to synthesize them. If the concatenated summaries exceed the context window, chunk the summaries themselves and reduce again. That’s the recursive safety net.
The reduce_summarize function checks whether combined summaries fit in one prompt. If they don’t, it chunks and reduces again — automatically.
def reduce_summarize(summaries, api_key, model="gpt-4o-mini",
max_context_chars=6000):
"""Combine chunk summaries into a final summary.
Recursively reduces if combined summaries are too long.
"""
combined = "\n\n".join(summaries)
if len(combined) <= max_context_chars:
system_prompt = (
"You are a document summarizer. Combine these section "
"summaries into one coherent summary. Eliminate "
"redundancy. Preserve key facts and conclusions."
)
prompt = (
f"Combine these {len(summaries)} section summaries "
f"into one final summary:\n\n{combined}"
)
return call_llm(system_prompt, prompt, api_key, model)
print(f" Combined too long ({len(combined)} chars), "
f"reducing recursively...")
sub_chunks = sliding_window_chunk(
combined, chunk_size=max_context_chars, overlap=500
)
sub_summaries = map_summarize(sub_chunks, api_key, model)
return reduce_summarize(sub_summaries, api_key, model,
max_context_chars)
Here’s the full pipeline function that ties chunk, map, and reduce together.
def map_reduce_pipeline(document, api_key, chunk_size=2000,
overlap=300, model="gpt-4o-mini"):
"""Full map-reduce summarization pipeline."""
print("Step 1: Chunking document...")
chunks = smart_sliding_window(document, chunk_size, overlap)
print(f" Created {len(chunks)} chunks\n")
print("Step 2: Summarizing chunks (map)...")
summaries = map_summarize(chunks, api_key, model)
print(f" Got {len(summaries)} summaries\n")
print("Step 3: Combining summaries (reduce)...")
final_summary = reduce_summarize(summaries, api_key, model)
print(" Done!\n")
return final_summary, summaries
Three steps. Chunk, map, reduce. If the reduce step hits summaries that are too long, it recurses. You don’t manage that edge case yourself.
Key Insight: Map-reduce summarization mirrors the distributed computing pattern. Map = process parts independently. Reduce = combine results. The recursion handles arbitrarily long documents because each round shrinks the total text.
What Is Recursive Summarization?
Map-reduce runs one map pass, then one reduce. Recursive summarization works differently — it compresses in layers, like zooming out on a map.
First, summarize chunks into paragraph-level summaries. Then summarize those paragraphs into section-level summaries. Keep going until the result fits in one prompt. Each layer compresses by a fixed ratio.
Why bother? Two reasons. It preserves hierarchical structure better. And it handles truly massive documents — full books, entire codebases — where even the first round of summaries is too long for one reduce call.
def recursive_summarize(document, api_key,
chunk_size=2000, overlap=300,
target_length=500,
model="gpt-4o-mini", depth=0):
"""Recursively summarize until result fits target_length."""
indent = " " * depth
print(f"{indent}Depth {depth}: "
f"length = {len(document)} chars")
if len(document) <= target_length:
print(f"{indent} -> Fits target. Done.")
return document
chunks = smart_sliding_window(document, chunk_size, overlap)
print(f"{indent} -> Split into {len(chunks)} chunks")
summaries = []
for chunk in chunks:
system_prompt = (
"Condense this text to about one-third its length. "
"Preserve key facts, numbers, and conclusions."
)
summary = call_llm(
system_prompt, f"Summarize:\n\n{chunk}",
api_key, model
)
summaries.append(summary)
combined = "\n\n".join(summaries)
print(f"{indent} -> Combined: {len(combined)} chars")
return recursive_summarize(
combined, api_key, chunk_size, overlap,
target_length, model, depth + 1
)
Watch the depth counter. Each recursion prints its level. A 50,000-character document might take 3 rounds to reach a 500-character summary. You set the target length and let it run.
Which Strategy Should You Choose?
Here’s a comparison function that runs both strategies on the same document.
def compare_strategies(document, api_key, model="gpt-4o-mini"):
"""Run map-reduce and recursive on the same document."""
results = {}
print("=" * 50)
print("STRATEGY 1: Map-Reduce")
print("=" * 50)
start = time.time()
mr_summary, mr_chunks = map_reduce_pipeline(
document, api_key, 2000, 300, model
)
results["map_reduce"] = {
"summary": mr_summary,
"api_calls": len(mr_chunks) + 1,
"time": time.time() - start,
"length": len(mr_summary)
}
print("=" * 50)
print("STRATEGY 2: Recursive")
print("=" * 50)
start = time.time()
rec_summary = recursive_summarize(
document, api_key, 2000, 300, 800, model
)
results["recursive"] = {
"summary": rec_summary,
"time": time.time() - start,
"length": len(rec_summary)
}
return results
Use compare_strategies(document, api_key) to see both in action. Map-reduce is faster for moderate documents — fewer total API calls. Recursive shines on very long documents where chunk summaries themselves overflow.
Here’s the decision guide:
| Scenario | Best Strategy | Why |
|---|---|---|
| 1-2 chunks | Direct summarization | No chunking overhead needed |
| 5-50 chunks | Map-reduce | One map + one reduce is efficient |
| 50+ chunks | Recursive | Handles any length gracefully |
| Hierarchical structure needed | Recursive | Natural multi-level compression |
| Speed over perfection | Map-reduce | Fewer total API calls |
Tip: Cut API costs with a hybrid approach. Use `gpt-4o-mini` for the map phase (cheap bulk work) and `gpt-4o` for the final reduce step (quality matters most there). You get the best of both worlds.
{
type: ‘exercise’,
id: ‘sliding-window-ex1’,
title: ‘Exercise 1: Build a Token-Aware Chunker’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘The sliding_window_chunk function splits by character count. But API billing uses tokens. Write a function token_aware_chunk(text, max_tokens=500, overlap_tokens=50) that converts token limits to character limits (1 token = 4 chars) and calls sliding_window_chunk. Print chunk count and each chunk estimated token count.’,
starterCode: `def token_aware_chunk(text, max_tokens=500, overlap_tokens=50):
“””Split text into chunks based on token count.”””
chunk_size_chars = max_tokens * 4
overlap_chars = overlap_tokens * 4
# YOUR CODE: call sliding_window_chunk with char equivalents
pass
chunks = token_aware_chunk(document, max_tokens=500, overlap_tokens=50)
print(f”Number of chunks: {len(chunks)}”)
for i, c in enumerate(chunks):
print(f” Chunk {i+1}: ~{estimate_tokens(c)} tokens”),def token_aware_chunk(text, max_tokens=500, overlap_tokens=50):
testCases: [
{ id: 'tc1', input: 'chunks = token_aware_chunk(document, max_tokens=500, overlap_tokens=50)\nprint(len(chunks))', expectedOutput: '7', description: 'Should produce 7 chunks for our test document' },
{ id: 'tc2', input: 'chunks = token_aware_chunk("short", max_tokens=500)\nprint(len(chunks))', expectedOutput: '1', description: 'Short text should produce 1 chunk' },
],
hints: [
'Convert tokens to characters by multiplying by 4, then pass both values to sliding_window_chunk.',
'return sliding_window_chunk(text, chunk_size_chars, overlap_chars)',
],
solution:
chunk_size_chars = max_tokens * 4
overlap_chars = overlap_tokens * 4
return sliding_window_chunk(text, chunk_size_chars, overlap_chars)
chunks = token_aware_chunk(document, max_tokens=500, overlap_tokens=50)
print(f”Number of chunks: {len(chunks)}”)
for i, c in enumerate(chunks):
print(f” Chunk {i+1}: ~{estimate_tokens(c)} tokens”)`,
solutionExplanation: ‘Since our estimator uses 1 token per 4 characters, multiply token limits by 4 to get character limits. Then reuse sliding_window_chunk. This keeps chunking aligned with API billing.’,
xpReward: 15,
}
How Do You Handle Edge Cases in Production?
Real documents aren’t clean text files. They have headers, tables, code blocks, and formatting that can trip up your chunker. Here are the patterns that cause problems.
Structure-Aware Chunking
The first issue: chunk boundaries splitting key information. A table spanning a boundary produces two useless halves. The fix — detect structural markers like headers and paragraph breaks, and split there instead.
def structure_aware_chunker(text, chunk_size=2000, overlap=300):
"""Chunk text at paragraph/section boundaries."""
sections = re.split(r'\n\n+', text)
chunks = []
current_chunk = ""
for section in sections:
if (len(current_chunk) + len(section) > chunk_size
and current_chunk):
chunks.append(current_chunk.strip())
current_chunk = current_chunk[-overlap:] + "\n\n"
current_chunk += section + "\n\n"
if current_chunk.strip():
chunks.append(current_chunk.strip())
return chunks
struct_chunks = structure_aware_chunker(document)
print(f"Structure-aware chunks: {len(struct_chunks)}")
for i, c in enumerate(struct_chunks[:3]):
print(f" Chunk {i+1}: {len(c)} chars")
Filtering Tiny Chunks
Empty or near-empty chunks waste API calls and add noise. Filter them before the map phase.
def filter_chunks(chunks, min_chars=100):
"""Remove chunks too short to summarize."""
filtered = [c for c in chunks if len(c) >= min_chars]
removed = len(chunks) - len(filtered)
if removed:
print(f"Filtered {removed} chunks under {min_chars} chars")
return filtered
clean_chunks = filter_chunks(struct_chunks, min_chars=100)
print(f"After filtering: {len(clean_chunks)} chunks")
Relevance Scoring
Not every chunk matters equally. A conclusion matters more than boilerplate acknowledgements. A simple heuristic scores chunks by signal words — chunks with higher scores get a longer summary allocation.
def score_chunk_relevance(chunk):
"""Score chunk importance using signal words."""
signals = [
"conclusion", "result", "finding", "significant",
"important", "key", "critical", "recommend",
"therefore", "in summary", "we found"
]
text_lower = chunk.lower()
return sum(1 for w in signals if w in text_lower)
for i, chunk in enumerate(clean_chunks[:5]):
score = score_chunk_relevance(chunk)
print(f"Chunk {i+1}: relevance = {score}")
Resilient API Calls
One timeout shouldn’t kill your entire pipeline. Add retry logic with exponential backoff.
def resilient_call(system, message, api_key, retries=3):
"""Call LLM with retry logic for production use."""
for attempt in range(retries):
try:
return call_llm(system, message, api_key)
except Exception as e:
if attempt < retries - 1:
wait = 2 ** attempt
print(f" Retry {attempt+1} in {wait}s: {e}")
time.sleep(wait)
else:
return f"[SUMMARY FAILED: {str(e)[:100]}]"
Warning: Don’t skip overlap in production. It’s tempting to set overlap to 0 for speed. Without overlap, you miss every insight that spans two chunks. A 10-15% overlap is cheap insurance.
{
type: ‘exercise’,
id: ‘sliding-window-ex2’,
title: ‘Exercise 2: Track Compression Metadata’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Write a function map_with_metadata(chunks) that simulates summarization (take first 25% of each chunk) and returns a list of dicts with: chunk_index (int), original_length (chars), summary_length (chars), compression_ratio (original / summary, rounded to 1 decimal). Print each chunk metadata.’,
starterCode: `def map_with_metadata(chunks):
“””Summarize chunks and track metadata.”””
results = []
for i, chunk in enumerate(chunks):
summary = chunk[:len(chunk) // 4] # simulate summary
# YOUR CODE: build dict with chunk_index, original_length,
# summary_length, compression_ratio
pass
return results
chunks = sliding_window_chunk(document, chunk_size=2000, overlap=300)
meta = map_with_metadata(chunks)
for m in meta:
print(f”Chunk {m[‘chunk_index’]}: {m[‘original_length’]} -> ”
f”{m[‘summary_length’]} chars ({m[‘compression_ratio’]}x)”),def map_with_metadata(chunks):
testCases: [
{ id: 'tc1', input: 'chunks = sliding_window_chunk(document, 2000, 300)\nm = map_with_metadata(chunks)\nprint(m[0]["chunk_index"])', expectedOutput: '0' },
{ id: 'tc2', input: 'chunks = sliding_window_chunk(document, 2000, 300)\nm = map_with_metadata(chunks)\nprint(type(m[0]["compression_ratio"]).__name__)', expectedOutput: 'float' },
],
hints: [
'Build a dict: {"chunk_index": i, "original_length": len(chunk), ...}. Use round(len(chunk)/len(summary), 1) for ratio.',
'Append {"chunk_index": i, "original_length": len(chunk), "summary_length": len(summary), "compression_ratio": round(len(chunk)/len(summary), 1)} to results.',
],
solution:
results = []
for i, chunk in enumerate(chunks):
summary = chunk[:len(chunk) // 4]
results.append({
“chunk_index”: i,
“original_length”: len(chunk),
“summary_length”: len(summary),
“compression_ratio”: round(len(chunk) / len(summary), 1)
})
return results
chunks = sliding_window_chunk(document, chunk_size=2000, overlap=300)
meta = map_with_metadata(chunks)
for m in meta:
print(f”Chunk {m[‘chunk_index’]}: {m[‘original_length’]} -> ”
f”{m[‘summary_length’]} chars ({m[‘compression_ratio’]}x)”)`,
solutionExplanation: ‘Each chunk gets a metadata dict tracking index, sizes, and compression ratio. In production, this helps spot chunks that compressed poorly or produced suspiciously short summaries.’,
xpReward: 15,
}
When Should You NOT Use Sliding Window Summarization?
This approach isn’t always the right call. Here are three cases where you should pick a different strategy.
When the document fits comfortably in context. If your document is 50K tokens and your model handles 128K, just send it directly. Chunking adds complexity and API calls for no benefit.
When order and nuance matter more than coverage. Sliding window summarization compresses aggressively. If you need to preserve the exact argument flow of a legal brief or the narrative arc of a story, consider extractive summarization instead — it pulls key sentences verbatim rather than rewriting.
When you need real-time speed. Each chunk requires a separate API call. A 100-chunk document means 100+ API calls. If you need sub-second responses (like in a chat interface), use truncation or pre-computed summaries. Sliding window summarization is a batch operation, not a real-time one.
Note: Alternative approaches: For documents that fit in context but need focused summaries, try chain-of-density prompting — it progressively adds detail to a summary. For code-heavy documents, consider AST-based chunking that splits at function or class boundaries rather than character counts.
Common Mistakes and How to Fix Them
Mistake 1: No overlap between chunks
❌ Wrong:
bad_chunks = sliding_window_chunk(
document, chunk_size=2000, overlap=0
)
Ideas spanning chunk boundaries get split. The summarizer sees half a thought in each chunk. The final summary misses those ideas.
✅ Correct:
good_chunks = sliding_window_chunk(
document, chunk_size=2000, overlap=300
)
Mistake 2: No error handling on API calls
❌ Wrong:
for chunk in chunks:
summary = call_llm(system, chunk, api_key) # one failure kills all
One rate limit error kills the pipeline. You lose every summary already generated.
✅ Correct:
for chunk in chunks:
summary = resilient_call(system, chunk, api_key, retries=3)
Mistake 3: Same prompt for map and reduce
❌ Wrong:
prompt = "Summarize this text." # used for both phases
The map phase extracts facts from raw text. The reduce phase synthesizes and deduplicates across summaries. Different jobs need different instructions.
✅ Correct:
map_prompt = "Extract key facts, numbers, and conclusions."
reduce_prompt = ("Combine these section summaries into one "
"coherent summary. Eliminate redundancy.")
The Complete Pipeline Class
Here’s everything wrapped in a class. Configure once, then call summarize() on any document. This uses structure-aware chunking, filtering, and your choice of strategy.
class SlidingWindowSummarizer:
"""Production-ready sliding window summarizer."""
def __init__(self, api_key, model="gpt-4o-mini",
chunk_size=2000, overlap=300):
self.api_key = api_key
self.model = model
self.chunk_size = chunk_size
self.overlap = overlap
def summarize(self, document, strategy="map_reduce"):
"""Summarize a document with the chosen strategy."""
print(f"Strategy: {strategy}")
print(f"Document: {len(document)} chars, "
f"~{estimate_tokens(document)} tokens\n")
chunks = structure_aware_chunker(
document, self.chunk_size, self.overlap
)
chunks = filter_chunks(chunks, min_chars=100)
if strategy == "map_reduce":
sums = map_summarize(chunks, self.api_key, self.model)
return reduce_summarize(sums, self.api_key, self.model)
elif strategy == "recursive":
return recursive_summarize(
document, self.api_key,
self.chunk_size, self.overlap,
target_length=800, model=self.model
)
# Usage (requires API key):
# summarizer = SlidingWindowSummarizer(api_key="sk-...")
# result = summarizer.summarize(document, "map_reduce")
# print(result)
Complete Code
Summary
You’ve built a complete sliding window summarization pipeline. Here’s what you know now:
- Sliding window chunking splits documents into overlapping pieces. The overlap prevents losing ideas at boundaries.
- Map-reduce summarization processes chunks independently, then combines. Fast for moderate documents.
- Recursive summarization compresses in layers until the result fits. Handles arbitrarily long documents.
- Structure-aware chunking respects paragraph and section boundaries for cleaner results.
- Production hardening means retry logic, chunk filtering, and relevance scoring.
Use map-reduce for most jobs. Switch to recursive when chunk summaries themselves overflow.
Practice Exercise
Frequently Asked Questions
How do you choose the right chunk size for your document?
Start with your model’s context limit. Subtract the system prompt and expected output length. Use 60-70% of what remains. For GPT-4o-mini at 128K context, that’s roughly 70K tokens per chunk. But smaller chunks (500-1000 tokens) often produce better summaries because the model focuses on less text.
Can you use this pipeline with Claude or Gemini instead of OpenAI?
Absolutely. Change the URL and headers in call_llm. For Claude, use https://api.anthropic.com/v1/messages with x-api-key and anthropic-version headers. For Gemini, use the Google AI endpoint. The chunking and pipeline logic stays identical.
How do you handle documents with images or tables?
Extract text first using a document parser — pymupdf for PDFs, BeautifulSoup for HTML. Convert tables to markdown format before chunking. Images require a multimodal model (GPT-4o vision or Claude 3.5) in a separate step.
What’s the maximum document size this pipeline handles?
No hard limit. Recursive summarization handles any length because each round compresses. A million-word document might need 4-5 rounds. The practical limit is cost and API rate limits, not the pipeline itself.
References
- OpenAI API documentation — Chat Completions. Link
- Anthropic documentation — Messages API. Link
- Liu, N. F. et al. — “Lost in the Middle: How Language Models Use Long Contexts.” TACL (2024). Link
- Chang, Y. et al. — “BooookScore: A systematic exploration of book-length summarization.” EMNLP (2024). Link
- OpenAI — tiktoken tokenizer library. Link
- Dean, J. and Ghemawat, S. — “MapReduce: Simplified Data Processing on Large Clusters.” OSDI (2004). Link
- Google AI documentation — Gemini API. Link
Free Course
Master Core Python — Your First Step into AI/ML
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
