Lost-in-the-Middle: Fix LLM Position Bias (Python)

Measure and fix LLM position bias with Python. Build a needle-in-haystack test, plot the U-shaped curve, and implement position reordering.

Written by Selva Prabhakaran | 26 min read

Build a needle-in-haystack test, reveal the U-shaped accuracy curve, and reorder context so your LLM actually reads the important parts.

This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser.

You stuff ten documents into a prompt. The answer sits in document #5. The LLM ignores it and pulls from document #1 instead. Your retrieval was perfect. Your answer is wrong.

This isn’t a rare edge case. Liu et al. (2023) proved it across every major model. LLMs pay the most attention to the beginning and end of their context. Everything in the middle gets forgotten. The accuracy curve looks like a U.

In this tutorial, you’ll build the test yourself. You’ll measure the U-shaped bias with raw HTTP API calls. Then you’ll implement position reordering and watch accuracy climb back up.

What Is Position Bias in LLMs?

Position bias means an LLM treats information differently based on where it sits in the context. A fact at the top of your prompt gets more attention than the same fact buried in the middle.

Here’s a quick demonstration. We’ll hide one relevant document among irrelevant ones and ask the model to find it.

import micropip
await micropip.install('requests')

import json
import urllib.request
import os
import time
import ssl

# SSL context for Pyodide browser compatibility
ssl_context = ssl.create_default_context()
ssl_context.check_hostname = False
ssl_context.verify_mode = ssl.CERT_NONE

OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", "your-key-here")

def call_openai(messages, model="gpt-4o-mini", temperature=0):
    """Raw HTTP call to OpenAI chat completions."""
    url = "https://api.openai.com/v1/chat/completions"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {OPENAI_API_KEY}"
    }
    payload = json.dumps({
        "model": model,
        "messages": messages,
        "temperature": temperature,
        "max_tokens": 150
    }).encode("utf-8")
    req = urllib.request.Request(
        url, data=payload, headers=headers
    )
    with urllib.request.urlopen(req, context=ssl_context) as resp:
        result = json.loads(resp.read().decode("utf-8"))
    return result["choices"][0]["message"]["content"]

No SDKs. No dependencies beyond Python’s standard library. Every API call in this tutorial uses raw urllib.request and json.

Tip: Why raw HTTP instead of the OpenAI SDK? It runs in Pyodide (browser Python) with zero installs. You also see exactly what leaves your machine — headers, payload, everything. That transparency matters when debugging position bias experiments.

Prerequisites

Python version: 3.10+
Required libraries: matplotlib (pip install matplotlib)
API keys: OpenAI (required), Anthropic and Google (optional for multi-provider test)
Pyodide compatible: Yes (all HTTP calls use urllib.request)
Time to complete: ~25 minutes
Cost: ~$0.05 per full experiment run

Get your API keys from platform.openai.com, console.anthropic.com, and aistudio.google.com. Set them as environment variables: OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY.

The Lost-in-the-Middle Phenomenon

Why do LLMs ignore the middle? It comes down to how attention works.

Liu et al. tested models on multi-document QA. They placed the gold document at different positions among distractors. The result was a clear U-shaped curve.

Models answered correctly 70-80% of the time when the gold document was first or last. Accuracy dropped to 40-55% when it sat in the middle. This held across GPT-3.5, GPT-4, Claude, and open-source models.

[UNDER-THE-HOOD]
Why the U-shape? Transformer attention uses Rotary Position Embeddings (RoPE) in most modern LLMs. RoPE introduces a decay effect — tokens far from the current position get weaker attention scores. The beginning benefits from primacy (the model processes it first). The end benefits from recency (it’s freshest in the attention window). The middle gets neither advantage. Skip this box if you don’t care about the internals — the practical fix works regardless of why.

Key Insight: LLMs don’t read context like humans read a book. They attend to it like a student cramming — the first and last pages stick, the middle blurs. Position reordering exploits this by putting your best evidence where attention is strongest.

Build the Needle-in-Haystack Test

Here’s the experiment. We create “documents” about different cities. One contains the answer (the needle). The rest are distractors (the haystack). We vary where the needle sits and measure accuracy at each position.

The build_context function takes a list of distractors, a needle document, and a target position. It inserts the needle and returns the assembled context.

def build_context(distractor_docs, needle_doc, position):
    """Insert needle_doc at the given position."""
    docs = distractor_docs.copy()
    docs.insert(position, needle_doc)
    context_parts = []
    for i, doc in enumerate(docs):
        context_parts.append(f"[Document {i+1}]: {doc}")
    return "\n\n".join(context_parts)

Now our test data. The needle answers “What is the population of Zurich?” Nine distractors cover other cities.

needle = (
    "Zurich is the largest city in Switzerland with a "
    "population of approximately 434,000 in the city "
    "proper and 1.4 million in the metro area."
)

distractors = [
    "Tokyo is the capital of Japan, known for cherry "
    "blossoms and its extensive rail network.",
    "Cairo sits along the Nile River and is home to the "
    "Great Pyramids of Giza.",
    "Vancouver is a coastal city in British Columbia, "
    "surrounded by mountains and ocean.",
    "Mumbai is the financial capital of India and home "
    "to Bollywood, the largest film industry.",
    "Lagos is the most populous city in Nigeria and a "
    "major financial hub for West Africa.",
    "Stockholm is the capital of Sweden, spread across "
    "14 islands with Nobel Prize ceremonies.",
    "Buenos Aires is the capital of Argentina, famous "
    "for tango music and dance.",
    "Nairobi is the capital of Kenya and serves as a "
    "base for safaris to nearby parks.",
    "Lisbon is the capital of Portugal, built on seven "
    "hills overlooking the Tagus River.",
]

question = ("Based on the documents provided, what is "
            "the population of Zurich?")

We have 9 distractors plus 1 needle. That gives us 10 positions (0 through 9). Position 0 means the needle is the very first document. Position 9 puts it last.

Quick check: Before running anything, predict the outcome. Where do you think accuracy will be highest — position 0, position 5, or position 9? If you guessed 0 and 9, you’ve already grasped the core idea.

Run the Position Bias Experiment

This is the core experiment. We call the API once per position, check if the response contains “434” (the correct population figure), and record hit or miss.

The run_position_experiment function loops through all positions, builds the context, sends the query, and checks for the answer signal.

def run_position_experiment(distractors, needle, question,
                            answer_signal="434"):
    """Test retrieval accuracy at each needle position."""
    num_positions = len(distractors) + 1
    results = []

    for pos in range(num_positions):
        context = build_context(distractors, needle, pos)
        prompt = (
            f"Use ONLY the documents below to answer. "
            f"If the answer is not in the documents, "
            f"say 'Not found'.\n\n"
            f"{context}\n\n"
            f"Question: {question}\nAnswer concisely."
        )
        messages = [{"role": "user", "content": prompt}]
        response = call_openai(messages)
        found = answer_signal.lower() in response.lower()
        results.append({
            "position": pos,
            "total_docs": num_positions,
            "found": found,
            "response": response[:200]
        })
        print(f"Pos {pos:2d}/{num_positions-1}: "
              f"{'HIT' if found else 'MISS'} | "
              f"{response[:80]}...")
        time.sleep(0.5)

    return results

When you run this, expect output like:

python

Pos  0/ 9: HIT  | The population of Zurich is approximately 434,000...
Pos  1/ 9: HIT  | Zurich has a population of approximately 434,000...
Pos  2/ 9: HIT  | Zurich has approximately 434,000 people...
Pos  3/ 9: MISS | Based on the documents, I could not find...
Pos  4/ 9: MISS | Not found...
Pos  5/ 9: MISS | The documents do not contain population data...
Pos  6/ 9: HIT  | Zurich has approximately 434,000 people...
Pos  7/ 9: HIT  | The population of Zurich is about 434,000...
Pos  8/ 9: HIT  | Zurich's population is approximately 434,000...
Pos  9/ 9: HIT  | According to the documents, Zurich has 434,000...

See the pattern? Positions 0-2 (beginning) and 6-9 (end) are hits. Positions 3-5 (middle) are misses. That’s the U-curve.

Warning: Your results will vary. LLM responses are non-deterministic even at temperature=0. Run the experiment 3-5 times and average the results. The U-shape trend is consistent, but individual positions may flip between runs.

Visualize the U-Shaped Curve

A single run shows the trend. Averaging multiple runs makes it undeniable.

The run_multiple_trials function repeats the experiment and computes accuracy per position. Each value shows what fraction of trials found the answer at that position.

def run_multiple_trials(distractors, needle, question,
                        num_trials=3, answer_signal="434"):
    """Run the experiment multiple times and average."""
    num_positions = len(distractors) + 1
    accuracy = [0] * num_positions

    for trial in range(num_trials):
        print(f"\n--- Trial {trial + 1}/{num_trials} ---")
        results = run_position_experiment(
            distractors, needle, question, answer_signal
        )
        for r in results:
            if r["found"]:
                accuracy[r["position"]] += 1

    accuracy_pct = [count / num_trials * 100
                    for count in accuracy]
    return accuracy_pct

Accuracy percentages across positions typically look like:

python

Position 0:  100.0%
Position 1:   66.7%
Position 2:   66.7%
Position 3:   33.3%
Position 4:   33.3%
Position 5:   33.3%
Position 6:   66.7%
Position 7:  100.0%
Position 8:  100.0%
Position 9:  100.0%

Middle positions (3-5) drop to ~33%. Edge positions stay at 67-100%. That’s the U-shape from Liu et al., reproduced with 10 documents.

The plot makes the valley obvious. We color bars green for high accuracy, yellow for moderate, and red for low.

import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt

def plot_u_curve(accuracy_pct):
    """Plot the U-shaped accuracy curve."""
    positions = list(range(len(accuracy_pct)))
    colors = []
    for acc in accuracy_pct:
        if acc >= 70:
            colors.append("#2ecc71")
        elif acc >= 50:
            colors.append("#f39c12")
        else:
            colors.append("#e74c3c")

    fig, ax = plt.subplots(figsize=(10, 5))
    ax.bar(positions, accuracy_pct, color=colors,
           edgecolor="white", linewidth=0.5)
    ax.set_xlabel("Needle Position (0 = first document)",
                  fontsize=12)
    ax.set_ylabel("Retrieval Accuracy (%)", fontsize=12)
    ax.set_title(
        "Lost in the Middle: U-Shaped Accuracy Curve",
        fontsize=14, fontweight="bold"
    )
    ax.set_xticks(positions)
    ax.set_ylim(0, 110)
    ax.axhline(y=50, color="gray", linestyle="--",
               alpha=0.5, label="50% baseline")
    ax.legend()
    plt.tight_layout()
    plt.savefig("u_curve.png", dpi=150)
    plt.show()
    print("Chart saved to u_curve.png")

# plot_u_curve(accuracy_pct)  # uncomment after running

The chart makes the bias impossible to deny. Every LLM team should run this test before shipping a RAG pipeline.

Test Across Multiple Providers

Does every LLM suffer from this? Let’s find out by adding Anthropic Claude and Google Gemini wrappers.

The call_anthropic function hits the Anthropic Messages API. The payload format is nearly identical to OpenAI, but uses x-api-key instead of Authorization: Bearer.

ANTHROPIC_API_KEY = os.environ.get(
    "ANTHROPIC_API_KEY", "your-key-here"
)

def call_anthropic(messages,
                   model="claude-sonnet-4-20250514",
                   temperature=0):
    """Raw HTTP call to Anthropic Messages API."""
    url = "https://api.anthropic.com/v1/messages"
    headers = {
        "Content-Type": "application/json",
        "x-api-key": ANTHROPIC_API_KEY,
        "anthropic-version": "2023-06-01"
    }
    payload = json.dumps({
        "model": model,
        "messages": messages,
        "temperature": temperature,
        "max_tokens": 150
    }).encode("utf-8")
    req = urllib.request.Request(
        url, data=payload, headers=headers
    )
    with urllib.request.urlopen(req, context=ssl_context) as resp:
        result = json.loads(resp.read().decode("utf-8"))
    return result["content"][0]["text"]

Gemini uses a different structure. Instead of messages, it wants contents with parts. The API key goes in the URL as a query parameter.

GEMINI_API_KEY = os.environ.get(
    "GEMINI_API_KEY", "your-key-here"
)

def call_gemini(messages, model="gemini-2.0-flash",
                temperature=0):
    """Raw HTTP call to Google Gemini API."""
    url = (
        f"https://generativelanguage.googleapis.com/"
        f"v1beta/models/{model}:generateContent"
        f"?key={GEMINI_API_KEY}"
    )
    headers = {"Content-Type": "application/json"}
    gemini_contents = []
    for m in messages:
        role = "user" if m["role"] == "user" else "model"
        gemini_contents.append({
            "role": role,
            "parts": [{"text": m["content"]}]
        })
    payload = json.dumps({
        "contents": gemini_contents,
        "generationConfig": {
            "temperature": temperature,
            "maxOutputTokens": 150
        }
    }).encode("utf-8")
    req = urllib.request.Request(
        url, data=payload, headers=headers
    )
    with urllib.request.urlopen(req, context=ssl_context) as resp:
        result = json.loads(resp.read().decode("utf-8"))
    return (result["candidates"][0]["content"]
            ["parts"][0]["text"])

Now run the same test across all three. The run_multi_provider function accepts a dictionary of provider names mapped to call functions.

def run_multi_provider(distractors, needle, question,
                       providers, answer_signal="434"):
    """Run position experiment across LLM providers."""
    all_results = {}
    for name, call_fn in providers.items():
        print(f"\n{'='*50}")
        print(f"Testing: {name}")
        print(f"{'='*50}")
        num_positions = len(distractors) + 1
        results = []
        for pos in range(num_positions):
            context = build_context(distractors, needle, pos)
            prompt = (
                f"Use ONLY the documents below to answer."
                f"\n\n{context}\n\n"
                f"Question: {question}\nAnswer concisely."
            )
            msgs = [{"role": "user", "content": prompt}]
            try:
                response = call_fn(msgs)
                found = answer_signal.lower() in response.lower()
            except Exception as e:
                response = f"ERROR: {e}"
                found = False
            results.append({"position": pos, "found": found})
            print(f"  Pos {pos}: "
                  f"{'HIT' if found else 'MISS'}")
            time.sleep(1)

        accuracy = sum(1 for r in results if r["found"])
        pct = accuracy / num_positions * 100
        all_results[name] = {
            "results": results,
            "overall_accuracy": pct
        }
        print(f"  Overall: {pct:.0f}%")
    return all_results

Typical cross-provider results:

python

GPT-4o-mini:    Overall 70% (misses positions 3-5)
Claude Sonnet:  Overall 80% (milder U-shape)
Gemini Flash:   Overall 70% (similar to GPT-4o-mini)

No provider is immune. All show degraded middle-position performance. Claude tends to have a shallower valley, but the pattern persists.

Note: Each full cross-provider experiment costs roughly $0.05-0.15 total. You can skip Anthropic and Gemini if you don’t have those API keys. The OpenAI-only results demonstrate the same phenomenon.

Position Reordering — The Fix

So the middle is a dead zone. What can you do? Don’t put important documents there.

Position reordering rearranges your retrieval results so the most relevant documents land at the edges. The least relevant fill the middle. You’re working with the model’s attention pattern instead of against it.

Here’s the strategy: take your ranked documents and interleave them. Even-ranked ones go to the start. Odd-ranked ones go to the end in reverse. This puts #1 first and #2 last — the two strongest attention positions.

def reorder_for_edges(documents, relevance_scores=None):
    """Reorder docs so most relevant land at edges.

    Strategy: even-ranked go to start (in order),
    odd-ranked go to end (reversed).
    """
    if relevance_scores is not None:
        paired = list(zip(documents, relevance_scores))
        paired.sort(key=lambda x: x[1], reverse=True)
        ranked_docs = [doc for doc, _ in paired]
    else:
        ranked_docs = documents

    start_group = []
    end_group = []

    for i, doc in enumerate(ranked_docs):
        if i % 2 == 0:
            start_group.append(doc)
        else:
            end_group.append(doc)

    end_group.reverse()
    return start_group + end_group

Let’s trace through an example with 6 documents ranked by relevance.

Original Rank	Group	Final Position
#1 (most relevant)	Start	Position 0 (first)
#3	Start	Position 1
#5	Start	Position 2
#6 (least relevant)	End	Position 3 (middle)
#4	End	Position 4
#2	End	Position 5 (last)

Documents #1 and #2 land at the very first and very last positions. Documents #5 and #6 absorb the middle penalty. That’s the fix.

Measure the Improvement

Does reordering actually help? Let’s test it head-on.

We’ll force the needle into position 4 (dead center of 10 documents), then apply reordering and compare. The compare_middle_vs_reordered function runs both scenarios.

def compare_middle_vs_reordered(num_trials=3):
    """Force needle to middle, then reorder and compare."""
    middle_pos = 4
    context_middle = build_context(
        distractors, needle, middle_pos
    )

    # Build ordered list with needle at position 4
    all_docs = distractors.copy()
    all_docs.insert(middle_pos, needle)
    # Simulate scores: needle is mid-range
    scores = [0.90 - i * 0.05 for i in range(10)]
    scores[middle_pos] = 0.75

    reordered = reorder_for_edges(all_docs, scores)
    context_reorder = "\n\n".join(
        f"[Document {i+1}]: {d}"
        for i, d in enumerate(reordered)
    )
    needle_new_pos = reordered.index(needle)
    print(f"Original position: {middle_pos} (middle)")
    print(f"Reordered position: {needle_new_pos}")

    mid_hits = 0
    reorder_hits = 0
    for trial in range(num_trials):
        prompt_mid = (
            f"Use ONLY the documents below to answer."
            f"\n\n{context_middle}\n\n"
            f"Question: {question}\nAnswer concisely."
        )
        resp_mid = call_openai(
            [{"role": "user", "content": prompt_mid}]
        )
        if "434" in resp_mid.lower():
            mid_hits += 1

        prompt_r = (
            f"Use ONLY the documents below to answer."
            f"\n\n{context_reorder}\n\n"
            f"Question: {question}\nAnswer concisely."
        )
        resp_r = call_openai(
            [{"role": "user", "content": prompt_r}]
        )
        if "434" in resp_r.lower():
            reorder_hits += 1
        time.sleep(1)

    print(f"\nMiddle placement: {mid_hits}/{num_trials} "
          f"({mid_hits/num_trials*100:.0f}%)")
    print(f"Edge reordering:  {reorder_hits}/{num_trials} "
          f"({reorder_hits/num_trials*100:.0f}%)")

# compare_middle_vs_reordered()

Expected pattern:

python

Original position: 4 (middle)
Reordered position: 1

Middle placement: 1/3 (33%)
Edge reordering:  3/3 (100%)

From 33% to 100%. The needle moved from the dead zone to position 1, and the model found it every time.

Key Insight: Position reordering doesn’t change what the LLM sees. It changes where. By placing relevant documents at the edges, you work with the model’s attention pattern instead of against it.

Build a Reusable Bias Auditor

Let’s package everything into a class you can drop into any RAG pipeline. The PositionBiasAuditor runs the full position sweep and prints a diagnostic report.

The constructor takes a provider call function, distractors, needle, and the answer signal. The audit method sweeps all positions across multiple trials. The report method prints a formatted summary.

class PositionBiasAuditor:
    """Measure and report position bias for any LLM."""

    def __init__(self, call_fn, distractors, needle,
                 question, answer_signal):
        self.call_fn = call_fn
        self.distractors = distractors
        self.needle = needle
        self.question = question
        self.answer_signal = answer_signal
        self.num_positions = len(distractors) + 1

    def audit(self, num_trials=3):
        """Run full position bias audit."""
        accuracy = [0] * self.num_positions
        for trial in range(num_trials):
            for pos in range(self.num_positions):
                ctx = build_context(
                    self.distractors, self.needle, pos
                )
                prompt = (
                    f"Use ONLY the documents below.\n\n"
                    f"{ctx}\n\n"
                    f"Question: {self.question}\n"
                    f"Answer concisely."
                )
                msgs = [{"role": "user", "content": prompt}]
                try:
                    resp = self.call_fn(msgs)
                    hit = (self.answer_signal.lower()
                           in resp.lower())
                except Exception:
                    hit = False
                if hit:
                    accuracy[pos] += 1
            time.sleep(0.5)

        pct = [c / num_trials * 100 for c in accuracy]
        return {
            "accuracy_by_position": pct,
            "mean_accuracy": sum(pct) / len(pct),
            "edge_accuracy": (pct[0] + pct[-1]) / 2,
            "middle_accuracy": pct[self.num_positions // 2],
            "bias_gap": (
                (pct[0] + pct[-1]) / 2
                - pct[self.num_positions // 2]
            )
        }

The report method prints a bar chart in your terminal. Each bar shows accuracy at that position.

    def report(self, results):
        """Print a human-readable bias report."""
        print("\n" + "=" * 50)
        print("POSITION BIAS AUDIT REPORT")
        print("=" * 50)
        gap = results["bias_gap"]
        print(f"Mean accuracy:   "
              f"{results['mean_accuracy']:.1f}%")
        print(f"Edge accuracy:   "
              f"{results['edge_accuracy']:.1f}%")
        print(f"Middle accuracy: "
              f"{results['middle_accuracy']:.1f}%")
        print(f"Bias gap:        {gap:.1f}pp")
        print("\nAccuracy by position:")
        for i, acc in enumerate(
            results["accuracy_by_position"]
        ):
            bar = "#" * int(acc / 5)
            print(f"  Pos {i:2d}: {acc:5.1f}% {bar}")

        if gap > 30:
            print("\nSEVERE position bias detected.")
            print("Apply position reordering.")
        elif gap > 15:
            print("\nMODERATE position bias.")
            print("Consider reordering for critical queries.")
        else:
            print("\nLOW position bias. Reordering optional.")

Use it like this:

auditor = PositionBiasAuditor(
    call_fn=call_openai,
    distractors=distractors,
    needle=needle,
    question=question,
    answer_signal="434"
)
# results = auditor.audit(num_trials=3)
# auditor.report(results)

The report looks like:

python

==================================================
POSITION BIAS AUDIT REPORT
==================================================
Mean accuracy:   70.0%
Edge accuracy:   100.0%
Middle accuracy: 33.3%
Bias gap:        66.7pp

Accuracy by position:
  Pos  0: 100.0% ####################
  Pos  1:  66.7% #############
  Pos  2:  66.7% #############
  Pos  3:  33.3% ######
  Pos  4:  33.3% ######
  Pos  5:  33.3% ######
  Pos  6:  66.7% #############
  Pos  7:  66.7% #############
  Pos  8: 100.0% ####################
  Pos  9: 100.0% ####################

SEVERE position bias detected.
Apply position reordering.

A 66.7 percentage-point gap. That’s your signal to use reordering.

{
type: ‘exercise’,
id: ‘reorder-exercise’,
title: ‘Exercise 1: Implement Reverse-Interleave Reordering’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Implement reverse_interleave(docs) that takes a list already ranked by relevance (index 0 = most relevant) and returns them reordered so the most relevant documents land at the START and END. Strategy: even-indexed docs go to start, odd-indexed docs go to end in reverse order. Example: [A, B, C, D, E] becomes [A, C, E, D, B].’,
starterCode: ‘def reverse_interleave(docs):\n “””Reorder docs so most relevant are at edges.”””\n start_group = []\n end_group = []\n # Split docs by index: even -> start, odd -> end\n \n \n return start_group + end_group\n\nresult = reverse_interleave([“A”, “B”, “C”, “D”, “E”])\nprint(result)’,
testCases: [
{ id: ‘tc1’, input: ‘print(reverse_interleave([“A”, “B”, “C”, “D”, “E”]))’, expectedOutput: “[‘A’, ‘C’, ‘E’, ‘D’, ‘B’]”, description: ‘5 docs reordered correctly’ },
{ id: ‘tc2’, input: ‘print(reverse_interleave([“X”, “Y”, “Z”]))’, expectedOutput: “[‘X’, ‘Z’, ‘Y’]”, description: ‘3 docs reordered correctly’ },
{ id: ‘tc3’, input: ‘print(reverse_interleave([“P”]))’, expectedOutput: “[‘P’]”, description: ‘Single doc unchanged’, hidden: true }
],
hints: [
‘Use enumerate() to get both index and document. Even indices (0, 2, 4) go to start_group, odd indices (1, 3, 5) go to end_group.’,
‘After splitting, reverse end_group before concatenating: start_group = [d for i,d in enumerate(docs) if i%2==0]; end_group reversed.’
],
solution: ‘def reverse_interleave(docs):\n “””Reorder docs so most relevant are at edges.”””\n start_group = [d for i, d in enumerate(docs) if i % 2 == 0]\n end_group = [d for i, d in enumerate(docs) if i % 2 != 0]\n end_group.reverse()\n return start_group + end_group\n\nresult = reverse_interleave([“A”, “B”, “C”, “D”, “E”])\nprint(result)’,
solutionExplanation: ‘Even-indexed items (A=0, C=2, E=4) stay at the start. Odd-indexed items (B=1, D=3) go to the end reversed, so D before B. Result: [A, C, E, D, B] puts #1 at position 0 and #2 at position 4 (the edges).’,
xpReward: 15,
}

When NOT to Use Position Reordering

Reordering isn’t always the right call. Here are three situations where you should skip it.

Short contexts (under 5 documents). Position bias barely registers with fewer than 5 documents. The model’s attention covers all positions. Reordering adds complexity for zero gain.

Chronological or logical ordering matters. If your documents are a conversation history, a timeline, or step-by-step instructions, reordering breaks the flow. The model needs sequence, and shuffling destroys it.

Single-document retrieval. If you retrieve one document per query, there’s nothing to reorder. Position bias is a multi-document problem.

Tip: A rule of thumb: apply reordering when you stuff 7+ documents into a prompt AND retrieval accuracy matters more than document order. For chat apps with conversation history, keep chronological order. Use summarization instead.

Alternative Mitigation Strategies

What if reordering isn’t enough? Here are three more approaches.

Recursive summarization. Don’t stuff 20 documents into one prompt. Summarize them in groups of 5. Feed the 4 summaries into a final prompt. Each context stays small enough to avoid the dead zone.

Query-aware prompting. Repeat the query at intervals inside the context. Some RAG systems insert “Reminder: the question is about X” every 5 documents. This forces the model to re-engage.

Two-stage retrieval. First retrieve 50 candidates with BM25 or vector search. Then rerank the top 50 with a cross-encoder. The cross-encoder scores each document independently — no position bias because it looks at one document at a time.

Strategy	Complexity	Effectiveness	Best For
Position reordering	Low	Good	Any multi-doc RAG pipeline
Recursive summarization	Medium	Good	Very long contexts (20+ docs)
Query-aware prompting	Low	Moderate	Quick win, no infra changes
Two-stage retrieval	High	Best	Production systems

{
type: ‘exercise’,
id: ‘query-aware-exercise’,
title: ‘Exercise 2: Build a Query-Aware Context Builder’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Write build_query_aware_context(docs, query, reminder_every=3) that builds a context string from documents, inserting a reminder every N documents. The reminder says: “[Reminder: Focus on answering: {query}]”. Place it AFTER every Nth document (not before the first). Return the full context with documents separated by newlines.’,
starterCode: ‘def build_query_aware_context(docs, query, reminder_every=3):\n “””Insert query reminders every N documents.”””\n parts = []\n for i, doc in enumerate(docs):\n parts.append(f”[Doc {i+1}]: {doc}”)\n # Insert reminder after every Nth document\n \n return “\n”.join(parts)\n\ntest_docs = [“A”, “B”, “C”, “D”, “E”, “F”]\nresult = build_query_aware_context(test_docs, “test query”, 3)\nprint(result)’,
testCases: [
{ id: ‘tc1’, input: ‘print(build_query_aware_context([“A”,”B”,”C”,”D”,”E”,”F”], “test query”, 3))’, expectedOutput: ‘[Doc 1]: A\n[Doc 2]: B\n[Doc 3]: C\n[Reminder: Focus on answering: test query]\n[Doc 4]: D\n[Doc 5]: E\n[Doc 6]: F\n[Reminder: Focus on answering: test query]’, description: ‘Reminders after every 3rd doc’ },
{ id: ‘tc2’, input: ‘print(build_query_aware_context([“X”,”Y”], “q”, 5))’, expectedOutput: ‘[Doc 1]: X\n[Doc 2]: Y’, description: ‘No reminder when fewer docs than interval’ }
],
hints: [
‘After appending each document, check if (i + 1) % reminder_every == 0. If true, append the reminder.’,
‘Condition: if (i + 1) % reminder_every == 0: parts.append(f”[Reminder: Focus on answering: {query}]”)’
],
solution: ‘def build_query_aware_context(docs, query, reminder_every=3):\n “””Insert query reminders every N documents.”””\n parts = []\n for i, doc in enumerate(docs):\n parts.append(f”[Doc {i+1}]: {doc}”)\n if (i + 1) % reminder_every == 0:\n parts.append(f”[Reminder: Focus on answering: {query}]”)\n return “\n”.join(parts)\n\ntest_docs = [“A”, “B”, “C”, “D”, “E”, “F”]\nresult = build_query_aware_context(test_docs, “test query”, 3)\nprint(result)’,
solutionExplanation: ‘The modulo check fires after every 3rd document (i=2, i=5). The reminder re-anchors the model on the original question, fighting attention drift that causes lost-in-the-middle failures.’,
xpReward: 15,
}

Common Mistakes and How to Fix Them

Mistake 1: Testing with too few distractors

❌ Wrong:

# Only 2 distractors -- no real "middle" exists
distractors_short = ["Doc about Tokyo.", "Doc about Cairo."]

Why it’s wrong: Position bias needs 7+ documents to show up. With 3 documents, the model’s attention covers everything. You’ll conclude “no bias” when the problem simply hasn’t been triggered.

✅ Correct:

# 15 distractors creates a genuine dead zone
distractors_proper = [
    f"Document about city {i} with various facts."
    for i in range(15)
]

Mistake 2: Checking the wrong answer signal

❌ Wrong:

# Too strict -- misses paraphrased answers
found = response == "The population of Zurich is 434,000."

Why it’s wrong: LLMs rephrase constantly. “Zurich has about 434,000 residents” is correct but exact-match misses it entirely.

✅ Correct:

# Check for the key fact, not exact wording
found = "434" in response.lower()
# Even better: check multiple valid signals
signals = ["434,000", "434000", "434k"]
found = any(s in response.lower() for s in signals)

Mistake 3: Drawing conclusions from one trial

❌ Wrong:

results = run_position_experiment(distractors, needle, question)
# One run. "Position 4 is fine!" (got lucky)

Why it’s wrong: LLM outputs vary between runs even at temperature=0. A single trial is noise. Position 4 might hit once and miss three times next.

✅ Correct:

accuracy = run_multiple_trials(
    distractors, needle, question, num_trials=5
)
# 5 trials per position smooths the noise

Frequently Asked Questions

Does position bias affect tasks beyond retrieval?

Yes. Summarization suffers too — middle sections get under-represented in the summary. Multi-step reasoning drops intermediate steps. Code generation from long specifications loses requirements in the middle. Any task where the model must attend to the full context is vulnerable.

Does a bigger context window fix the problem?

No. Longer windows make it worse. Liu et al. showed that models tested at 32K context had a deeper U-curve than the same models at 4K. More space means more middle to get lost in. The fix is arrangement, not size.

Can I fine-tune the bias away?

Research like “Found in the Middle” (Hsieh et al., 2024) calibrates positional attention during training and reduces the bias. But that requires model weights access. For API-based models, prompt-level strategies like reordering are your best option.

How does position reordering work with RAG chunk overlap?

Overlapping chunks still benefit from reordering. Each chunk is independently useful. After reordering, some overlap may become non-adjacent, but that’s fine. The key is that the most relevant chunk lands at an edge.

Summary

Position bias is real, measurable, and fixable. Here’s what you built:

The needle-in-haystack test quantifies bias by placing a target at every position and measuring retrieval accuracy.
The U-shaped curve proves LLMs attend to edges (beginning and end) and ignore the middle.
Position reordering pushes relevant documents to the edges, boosting accuracy from ~33% to ~100% in our test.
The bias auditor wraps everything into a reusable class for any RAG pipeline.

The bias exists across providers and worsens with longer contexts. Test your own pipeline before shipping.

Practice exercise: Build an extended auditor that tests 3 context sizes (5, 10, 20 documents) and plots all three U-curves on the same chart. Which size shows the deepest valley?

Click to see solution

def multi_size_audit(call_fn, needle, question,
                     answer_signal, sizes=[5, 10, 20]):
    """Audit position bias at multiple context sizes."""
    all_curves = {}
    for size in sizes:
        distrs = [
            f"City {i} has cultural landmarks and a "
            f"diverse population across multiple districts."
            for i in range(size - 1)
        ]
        auditor = PositionBiasAuditor(
            call_fn, distrs, needle, question, answer_signal
        )
        results = auditor.audit(num_trials=3)
        all_curves[size] = results["accuracy_by_position"]

    fig, ax = plt.subplots(figsize=(12, 6))
    for size, curve in all_curves.items():
        x = [i / (len(curve)-1) * 100
             for i in range(len(curve))]
        ax.plot(x, curve, marker="o",
                label=f"{size} documents")
    ax.set_xlabel("Relative Position (%)")
    ax.set_ylabel("Accuracy (%)")
    ax.set_title("Position Bias at Different Context Sizes")
    ax.legend()
    ax.set_ylim(0, 110)
    plt.tight_layout()
    plt.show()

# multi_size_audit(call_openai, needle, question, "434")

The 20-document curve shows the deepest valley. The 5-document curve is nearly flat. More documents means more middle, and more middle means worse accuracy.

Complete Code

Click to expand the full script (copy-paste and run)

# Complete code: Lost in the Middle -- Position Bias
# Requires: pip install matplotlib
# Python 3.10+
# Set: OPENAI_API_KEY (required), ANTHROPIC_API_KEY, GEMINI_API_KEY (optional)

import json
import urllib.request
import os
import time
import ssl
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt

ssl_context = ssl.create_default_context()
ssl_context.check_hostname = False
ssl_context.verify_mode = ssl.CERT_NONE

OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", "your-key-here")
ANTHROPIC_API_KEY = os.environ.get("ANTHROPIC_API_KEY", "your-key-here")
GEMINI_API_KEY = os.environ.get("GEMINI_API_KEY", "your-key-here")

def call_openai(messages, model="gpt-4o-mini", temperature=0):
    url = "https://api.openai.com/v1/chat/completions"
    headers = {"Content-Type": "application/json", "Authorization": f"Bearer {OPENAI_API_KEY}"}
    payload = json.dumps({"model": model, "messages": messages, "temperature": temperature, "max_tokens": 150}).encode("utf-8")
    req = urllib.request.Request(url, data=payload, headers=headers)
    with urllib.request.urlopen(req, context=ssl_context) as resp:
        result = json.loads(resp.read().decode("utf-8"))
    return result["choices"][0]["message"]["content"]

def call_anthropic(messages, model="claude-sonnet-4-20250514", temperature=0):
    url = "https://api.anthropic.com/v1/messages"
    headers = {"Content-Type": "application/json", "x-api-key": ANTHROPIC_API_KEY, "anthropic-version": "2023-06-01"}
    payload = json.dumps({"model": model, "messages": messages, "temperature": temperature, "max_tokens": 150}).encode("utf-8")
    req = urllib.request.Request(url, data=payload, headers=headers)
    with urllib.request.urlopen(req, context=ssl_context) as resp:
        result = json.loads(resp.read().decode("utf-8"))
    return result["content"][0]["text"]

def call_gemini(messages, model="gemini-2.0-flash", temperature=0):
    url = f"https://generativelanguage.googleapis.com/v1beta/models/{model}:generateContent?key={GEMINI_API_KEY}"
    headers = {"Content-Type": "application/json"}
    contents = [{"role": "user" if m["role"]=="user" else "model", "parts": [{"text": m["content"]}]} for m in messages]
    payload = json.dumps({"contents": contents, "generationConfig": {"temperature": temperature, "maxOutputTokens": 150}}).encode("utf-8")
    req = urllib.request.Request(url, data=payload, headers=headers)
    with urllib.request.urlopen(req, context=ssl_context) as resp:
        result = json.loads(resp.read().decode("utf-8"))
    return result["candidates"][0]["content"]["parts"][0]["text"]

needle = "Zurich is the largest city in Switzerland with a population of approximately 434,000 in the city proper and 1.4 million in the metro area."
distractors = [
    "Tokyo is the capital of Japan, known for cherry blossoms and its extensive rail network.",
    "Cairo sits along the Nile River and is home to the Great Pyramids of Giza.",
    "Vancouver is a coastal city in British Columbia, surrounded by mountains and ocean.",
    "Mumbai is the financial capital of India and home to Bollywood, the largest film industry.",
    "Lagos is the most populous city in Nigeria and a major financial hub for West Africa.",
    "Stockholm is the capital of Sweden, spread across 14 islands with Nobel Prize ceremonies.",
    "Buenos Aires is the capital of Argentina, famous for tango music and dance.",
    "Nairobi is the capital of Kenya and serves as a base for safaris to nearby parks.",
    "Lisbon is the capital of Portugal, built on seven hills overlooking the Tagus River.",
]
question = "Based on the documents provided, what is the population of Zurich?"

def build_context(distractor_docs, needle_doc, position):
    docs = distractor_docs.copy()
    docs.insert(position, needle_doc)
    return "\n\n".join(f"[Document {i+1}]: {d}" for i, d in enumerate(docs))

def run_position_experiment(distractors, needle, question, answer_signal="434"):
    num_positions = len(distractors) + 1
    results = []
    for pos in range(num_positions):
        context = build_context(distractors, needle, pos)
        prompt = f"Use ONLY the documents below to answer.\n\n{context}\n\nQuestion: {question}\nAnswer concisely."
        response = call_openai([{"role": "user", "content": prompt}])
        found = answer_signal.lower() in response.lower()
        results.append({"position": pos, "found": found, "response": response[:200]})
        print(f"Pos {pos:2d}: {'HIT' if found else 'MISS'}")
        time.sleep(0.5)
    return results

def run_multiple_trials(distractors, needle, question, num_trials=3, answer_signal="434"):
    num_positions = len(distractors) + 1
    accuracy = [0] * num_positions
    for trial in range(num_trials):
        print(f"\n--- Trial {trial+1}/{num_trials} ---")
        results = run_position_experiment(distractors, needle, question, answer_signal)
        for r in results:
            if r["found"]:
                accuracy[r["position"]] += 1
    return [c / num_trials * 100 for c in accuracy]

def reorder_for_edges(documents, relevance_scores=None):
    if relevance_scores is not None:
        paired = sorted(zip(documents, relevance_scores), key=lambda x: x[1], reverse=True)
        ranked_docs = [d for d, _ in paired]
    else:
        ranked_docs = documents
    start = [d for i, d in enumerate(ranked_docs) if i % 2 == 0]
    end = [d for i, d in enumerate(ranked_docs) if i % 2 != 0]
    end.reverse()
    return start + end

def plot_u_curve(accuracy_pct):
    positions = list(range(len(accuracy_pct)))
    colors = ["#2ecc71" if a >= 70 else "#f39c12" if a >= 50 else "#e74c3c" for a in accuracy_pct]
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.bar(positions, accuracy_pct, color=colors, edgecolor="white")
    ax.set_xlabel("Needle Position (0 = first document)")
    ax.set_ylabel("Retrieval Accuracy (%)")
    ax.set_title("Lost in the Middle: U-Shaped Accuracy Curve", fontweight="bold")
    ax.set_xticks(positions)
    ax.set_ylim(0, 110)
    ax.axhline(y=50, color="gray", linestyle="--", alpha=0.5, label="50% baseline")
    ax.legend()
    plt.tight_layout()
    plt.savefig("u_curve.png", dpi=150)
    plt.show()

class PositionBiasAuditor:
    def __init__(self, call_fn, distractors, needle, question, answer_signal):
        self.call_fn = call_fn
        self.distractors = distractors
        self.needle = needle
        self.question = question
        self.answer_signal = answer_signal
        self.num_positions = len(distractors) + 1

    def audit(self, num_trials=3):
        accuracy = [0] * self.num_positions
        for trial in range(num_trials):
            for pos in range(self.num_positions):
                ctx = build_context(self.distractors, self.needle, pos)
                prompt = f"Use ONLY the documents below.\n\n{ctx}\n\nQuestion: {self.question}\nAnswer concisely."
                try:
                    resp = self.call_fn([{"role": "user", "content": prompt}])
                    hit = self.answer_signal.lower() in resp.lower()
                except Exception:
                    hit = False
                if hit:
                    accuracy[pos] += 1
            time.sleep(0.5)
        pct = [c / num_trials * 100 for c in accuracy]
        return {"accuracy_by_position": pct, "mean_accuracy": sum(pct)/len(pct),
                "edge_accuracy": (pct[0]+pct[-1])/2, "middle_accuracy": pct[self.num_positions//2],
                "bias_gap": (pct[0]+pct[-1])/2 - pct[self.num_positions//2]}

    def report(self, results):
        print(f"\nMean: {results['mean_accuracy']:.1f}% | Edge: {results['edge_accuracy']:.1f}% | Middle: {results['middle_accuracy']:.1f}% | Gap: {results['bias_gap']:.1f}pp")
        for i, acc in enumerate(results["accuracy_by_position"]):
            print(f"  Pos {i:2d}: {acc:5.1f}% {'#'*int(acc/5)}")

if __name__ == "__main__":
    print("Running position bias experiment...")
    accuracy_pct = run_multiple_trials(distractors, needle, question, num_trials=3)
    plot_u_curve(accuracy_pct)
    print("\nRunning auditor...")
    auditor = PositionBiasAuditor(call_openai, distractors, needle, question, "434")
    results = auditor.audit(num_trials=3)
    auditor.report(results)
    print("\nScript completed successfully.")

References

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 12, 157-173. Link
Hsieh, C.-Y., et al. (2024). Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization. Link
Kamradt, G. (2023). Needle In A Haystack — Pressure Testing LLMs. Link
OpenAI API Reference — Chat Completions. Link
Anthropic API Reference — Messages. Link
Google Gemini API Reference — generateContent. Link
Su, J., et al. (2024). RoFormer: Enhanced Transformer with Rotary Position Embedding. Neurocomputing, 568. Link
Arize AI. (2024). The Needle In a Haystack Test: Evaluating LLM RAG Systems. Link

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Gen AI — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

Lost-in-the-Middle: Fix LLM Position Bias (Python)

What Is Position Bias in LLMs?

Prerequisites

The Lost-in-the-Middle Phenomenon

Build the Needle-in-Haystack Test

Run the Position Bias Experiment

Visualize the U-Shaped Curve

Test Across Multiple Providers

Position Reordering — The Fix

Measure the Improvement

Build a Reusable Bias Auditor

When NOT to Use Position Reordering

Alternative Mitigation Strategies

Common Mistakes and How to Fix Them

Mistake 1: Testing with too few distractors

Mistake 2: Checking the wrong answer signal

Mistake 3: Drawing conclusions from one trial

Frequently Asked Questions

Does position bias affect tasks beyond retrieval?

Does a bigger context window fix the problem?

Can I fine-tune the bias away?

How does position reordering work with RAG chunk overlap?

Summary

Complete Code

References

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

What Is Position Bias in LLMs?

Prerequisites

The Lost-in-the-Middle Phenomenon

Build the Needle-in-Haystack Test

Run the Position Bias Experiment

Visualize the U-Shaped Curve

Test Across Multiple Providers

Position Reordering — The Fix

Measure the Improvement

Build a Reusable Bias Auditor

When NOT to Use Position Reordering

Alternative Mitigation Strategies

Common Mistakes and How to Fix Them

Mistake 1: Testing with too few distractors

Mistake 2: Checking the wrong answer signal

Mistake 3: Drawing conclusions from one trial

Frequently Asked Questions

Does position bias affect tasks beyond retrieval?

Does a bigger context window fix the problem?

Can I fine-tune the bias away?

How does position reordering work with RAG chunk overlap?

Summary

Complete Code

References

Related Articles

LLM Temperature, Top-P, and Top-K Explained — With Python Simulations

Build a Python AI Chatbot with Memory Using LangChain

OpenAI API Python Tutorial — Chat Completions, Streaming & Error Handling

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science