Menu

Temperature, Top-p & Top-k in LLMs Explained (Python)

Master LLM temperature, top-k, and top-p with interactive Python simulations. Runnable code, exercises, and a sampling playground to build real intuition.

Written by Selva Prabhakaran | 23 min read


This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.

Build interactive NumPy simulations that show how temperature, top-k, and nucleus sampling reshape a language model’s word choices.

Every time you call an LLM API, you pass parameters like temperature=0.7 or top_p=0.9. You’ve probably tweaked them, noticed vaguely different outputs, and moved on. But do you actually know what they do to the probability distribution?

Most people don’t. They’re tuning knobs blindfolded.

This article fixes that. We’ll build NumPy simulations that let you see — with real numbers — how each parameter reshapes next-token probabilities. No API keys needed. Everything runs in pure Python.

What Happens Before Sampling: Logits and Softmax

Before we touch any sampling parameter, you need to know where probabilities come from. I find this is the step most tutorials skip, and it causes confusion later.

An LLM doesn’t output probabilities directly. It outputs logits — raw scores, one per token in its vocabulary. A positive logit means “likely.” A negative logit means “unlikely.”

But logits aren’t probabilities. They don’t sum to 1.

The softmax function fixes that. It exponentiates each logit, then divides by the total. Larger logits get bigger probabilities. Smaller ones get squeezed toward zero.

Let’s build this from scratch. We’ll create a mock vocabulary of 8 tokens with hand-picked logits, then apply softmax. The softmax() function exponentiates each logit and normalizes so everything sums to 1.

import numpy as np

np.random.seed(42)

# Mock vocabulary: 8 tokens with raw logit scores
tokens = ["the", "a", "cat", "dog", "sat", "ran", "on", "happy"]
logits = np.array([2.0, 1.0, 1.5, 0.5, -0.5, -1.0, 0.3, -0.2])

def softmax(logits):
    """Convert raw logits to probabilities."""
    exp_logits = np.exp(logits - np.max(logits))
    return exp_logits / exp_logits.sum()

probs = softmax(logits)

print("Token        Logit    Probability")
print("-" * 38)
for token, logit, prob in zip(tokens, logits, probs):
    print(f"{token:<12} {logit:>6.1f}    {prob:.4f}")
print(f"\nSum of probabilities: {probs.sum():.4f}")

Output:

python
Token        Logit    Probability
--------------------------------------
the           2.0    0.3813
a             1.0    0.1403
cat           1.5    0.2313
dog           0.5    0.0851
sat          -0.5    0.0313
ran          -1.0    0.0190
on            0.3    0.0697
happy        -0.2    0.0423

Sum of probabilities: 1.0000

“the” (logit 2.0) grabbed 38% of the probability. “ran” (logit -1.0) got just 1.9%. Softmax preserves the ranking but creates an exponential gap between high and low scorers.

KEY INSIGHT: Every sampling parameter works by modifying either the logits before softmax or the probabilities after softmax. Understanding softmax is the foundation for everything else in this article.

Temperature: The Confidence Dial

Here’s a question that trips up most people: what does temperature actually do?

It divides every logit by a single number before softmax runs. That’s it. The formula is:

\[p_i = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}\]

Where:
\(p_i\) = probability of token $i$
\(z_i\) = the raw logit for token $i$
– $T$ = temperature (a positive number)
– The sum runs over all tokens in the vocabulary

If math isn’t your thing, skip to the code below — it tells the same story with real numbers.

When \(T = 1.0\), nothing changes. You get the default softmax. When \(T < 1.0\), you divide by a small number. That amplifies the gaps between logits. The top token dominates even more.

When \(T > 1.0\), you divide by a large number. That compresses logits toward zero. The distribution flattens, giving low-probability tokens a better chance.

Quick check: Before you run the next block, predict this. If we set \(T=0.5\), will "the" get more or less than its original 38.1%?

We'll apply softmax at three temperatures — 0.5 (sharp), 1.0 (default), and 2.0 (flat). The softmax_with_temperature() function divides logits by $T$ before calling softmax.

def softmax_with_temperature(logits, temperature):
    """Apply temperature scaling before softmax."""
    scaled = logits / temperature
    return softmax(scaled)

temperatures = [0.5, 1.0, 2.0]

print(f"{'Token':<10}", end="")
for t in temperatures:
    print(f"  T={t:<4}", end="")
print()
print("-" * 44)

for i, token in enumerate(tokens):
    print(f"{token:<10}", end="")
    for t in temperatures:
        p = softmax_with_temperature(logits, t)
        print(f"  {p[i]:.4f}", end="")
    print()

Output:

python
Token       T=0.5   T=1.0   T=2.0
--------------------------------------------
the         0.6220  0.3813  0.2423
a           0.0842  0.1403  0.1470
cat         0.2287  0.2313  0.1887
dog         0.0310  0.0851  0.1145
sat         0.0042  0.0313  0.0694
ran         0.0015  0.0190  0.0541
on          0.0208  0.0697  0.1036
happy       0.0076  0.0423  0.0807

If you predicted "more" — you're right. At \(T=0.5\), "the" jumps from 38.1% to 62.2%. Meanwhile, "ran" shrinks from 1.9% to 0.15%. That's a 42x difference between high and low temperature for a single token.

That's the tradeoff. Low temperature sharpens. High temperature flattens.

TIP: Use temperature=0.0 to 0.3 for factual Q&A and code generation. Use 0.7 to 1.0 for creative writing. Going above 1.5 is rarely useful — the output gets incoherent.

What Does Temperature = 0 Mean?

At \(T = 0\), the model always picks the highest-logit token. No randomness at all. This is called greedy decoding. APIs handle \(T=0\) as a special case — they skip sampling and return the argmax.

def greedy_decode(logits, tokens):
    """Temperature = 0: always pick the highest logit."""
    best_idx = np.argmax(logits)
    return tokens[best_idx], logits[best_idx]

winner, score = greedy_decode(logits, tokens)
print(f"Greedy pick: '{winner}' (logit: {score:.1f})")
python
Greedy pick: 'the' (logit: 2.0)

No surprises. "the" had the highest logit. Greedy decoding always picks it. Every single time. Zero variety.

Now that you understand how temperature reshapes the distribution, let's practice.

Exercise 1: Build a Temperature Sweep

{
  type: 'exercise',
  id: 'temp-sweep-ex1',
  title: 'Exercise 1: Build a Temperature Sweep',
  difficulty: 'intermediate',
  exerciseType: 'write',
  instructions: 'Write a function `temperature_sweep(logits, temps)` that takes a logit array and a list of temperatures. For each temperature, compute the softmax probabilities and return the probability of the *highest-logit* token at each temperature. Return a list of floats.\n\nFor example, with logits `[2.0, 1.0, 0.5]` and temps `[0.5, 1.0, 2.0]`, the highest-logit token is index 0. Return its probability at each temperature.',
  starterCode: 'def temperature_sweep(logits, temps):\n    """Return the probability of the top token at each temperature."""\n    top_idx = np.argmax(logits)\n    result = []\n    for t in temps:\n        # Apply temperature scaling and softmax\n        # Append probability of top_idx to result\n        pass\n    return result\n\n# Test\nlogits_test = np.array([2.0, 1.0, 0.5])\ntemps_test = [0.5, 1.0, 2.0]\nprint(temperature_sweep(logits_test, temps_test))',
  testCases: [
    { id: 'tc1', input: 'logits_test = np.array([2.0, 1.0, 0.5])\nprint([round(x, 4) for x in temperature_sweep(logits_test, [0.5, 1.0, 2.0])])', expectedOutput: '[0.8438, 0.6590, 0.5066]', description: 'Standard test with three temperatures' },
    { id: 'tc2', input: 'logits_test = np.array([3.0, 3.0, 3.0])\nprint([round(x, 4) for x in temperature_sweep(logits_test, [0.5, 1.0])])', expectedOutput: '[0.3333, 0.3333]', description: 'Equal logits — temperature should not matter' },
  ],
  hints: [
    'Inside the loop, compute scaled_logits = logits / t, then apply softmax to get probabilities.',
    'Full line: probs = softmax(logits / t); result.append(probs[top_idx])',
  ],
  solution: 'def temperature_sweep(logits, temps):\n    top_idx = np.argmax(logits)\n    result = []\n    for t in temps:\n        probs = softmax(logits / t)\n        result.append(probs[top_idx])\n    return result',
  solutionExplanation: 'For each temperature, divide the logits by T, apply softmax, and grab the probability at the index of the highest original logit. When logits are equal, softmax returns uniform probabilities regardless of temperature — the top token always gets 1/n.',
  xpReward: 15,
}

Top-k Filtering: Limiting the Candidate Pool

Temperature adjusts how spread out the probabilities are. But even at low temperature, the model still considers every token in its vocabulary.

Why is that a problem? An LLM's vocabulary is typically 32,000 to 128,000 tokens. Even with low temperature, there's a long tail with tiny probabilities. On rare occasions, the model samples from that tail and produces nonsense.

Top-k filtering takes a different approach. It keeps only the $k$ most probable tokens and zeroes out everything else. Then it renormalizes so the survivors sum to 1.

The top_k_filter() function finds the $k$ largest logits. Everything else gets set to negative infinity, which softmax converts to zero probability.

def top_k_filter(logits, k):
    """Keep only the top-k logits; set the rest to -inf."""
    top_k_indices = np.argsort(logits)[-k:]
    filtered = np.full_like(logits, -np.inf)
    filtered[top_k_indices] = logits[top_k_indices]
    return filtered

for k in [3, 5]:
    filtered = top_k_filter(logits, k)
    probs = softmax(filtered)
    print(f"\nTop-{k} filtering:")
    print(f"{'Token':<10} {'Logit':>8} {'Prob':>8}")
    print("-" * 28)
    for token, logit, prob in zip(tokens, filtered, probs):
        if prob > 0:
            print(f"{token:<10} {logit:>8.1f} {prob:>8.4f}")

Output:

python
Top-3 filtering:
Token         Logit     Prob
----------------------------
the            2.0   0.5066
a              1.0   0.1863
cat            1.5   0.3072

Top-5 filtering:
Token         Logit     Prob
----------------------------
the            2.0   0.4202
a              1.0   0.1546
cat            1.5   0.2549
dog            0.5   0.0937
on             0.3   0.0768

With \(k=3\), only "the", "cat", and "a" survive. "the" jumps from 38.1% to 50.7% because it no longer shares mass with five eliminated tokens.

WARNING: Setting \(k$ too low (like 1 or 2) makes output extremely repetitive. Setting it too high (like 1000) has almost no effect. A typical value is \)k=40$ to $50$.

The Problem with Top-k: One Size Doesn't Fit All

Here's a subtlety most tutorials skip. Top-k uses a fixed cutoff regardless of context.

Sometimes the model is confident — one token has 90% probability. Even \(k=5\) is too generous, including tokens the model doesn't want.

Other times, the model is uncertain. The top 20 tokens each have about 5%. With \(k=5\), you'd cut out 15 perfectly valid options.

Top-k can't adapt. That's exactly what top-p solves.

Top-p (Nucleus Sampling): The Adaptive Cutoff

Top-p, also called nucleus sampling, was introduced by Holtzman et al. (2019). It doesn't keep a fixed count of tokens. Instead, it keeps the smallest set whose cumulative probability reaches a threshold $p$.

Here's how it works. Sort tokens by probability, highest to lowest. Walk down the list, adding up probabilities. When the running total crosses $p$, stop. Everything below gets eliminated.

This adapts automatically. When the model is confident, the nucleus is tiny — maybe 2-3 tokens. When uncertain, it grows to include many options.

The top_p_filter() function sorts tokens by probability and computes a running total. np.searchsorted finds where the total first reaches $p$. Everything past that point becomes negative infinity.

def top_p_filter(logits, p):
    """Keep the smallest set of tokens with cumulative prob >= p."""
    probs = softmax(logits)
    sorted_indices = np.argsort(probs)[::-1]
    sorted_probs = probs[sorted_indices]
    cumulative = np.cumsum(sorted_probs)

    cutoff_idx = np.searchsorted(cumulative, p) + 1
    cutoff_idx = min(cutoff_idx, len(probs))

    kept_indices = sorted_indices[:cutoff_idx]
    filtered = np.full_like(logits, -np.inf)
    filtered[kept_indices] = logits[kept_indices]
    return filtered, kept_indices

for p_val in [0.8, 0.95]:
    filtered, kept = top_p_filter(logits, p_val)
    probs_filtered = softmax(filtered)
    print(f"\nTop-p = {p_val}  (nucleus size: {len(kept)} tokens)")
    print(f"{'Token':<10} {'Prob':>8}")
    print("-" * 20)
    for idx in kept:
        print(f"{tokens[idx]:<10} {probs_filtered[idx]:>8.4f}")

Output:

python
Top-p = 0.8  (nucleus size: 4 tokens)
Token          Prob
--------------------
the         0.4551
cat         0.2760
a           0.1674
dog         0.1015

Top-p = 0.95  (nucleus size: 6 tokens)
Token          Prob
--------------------
the         0.4014
cat         0.2435
a           0.1477
dog         0.0896
on          0.0733
happy       0.0445

With \(p=0.8\), four tokens make the cut. Cumulative: "the"(0.38) + "cat"(0.23) + "a"(0.14) = 0.75. That's under 0.80. Adding "dog"(0.09) pushes it to 0.84 — past the threshold.

At \(p=0.95\), the nucleus grows to 6 tokens. The distribution needs more options to cover 95% of the mass.

KEY INSIGHT: Top-p adapts to the model's confidence. When one token dominates, the nucleus shrinks. When the model is uncertain, it grows. This is why top-p generally produces more natural text than a fixed top-k.

Top-k vs Top-p: When Does Each Win?

ScenarioTop-kTop-pWinner
Model is confident (one token dominates)Keeps k tokens even though most are unwantedShrinks to 2-3 tokens automaticallyTop-p
Model is uncertain (flat distribution)Cuts valid options if k is too smallGrows to include all reasonable tokensTop-p
You need a hard ceiling on candidatesGuarantees at most k tokensNucleus size varies per stepTop-k
Computational simplicityJust sort and sliceRequires cumulative sum + thresholdTop-k

In practice, top-p is the better default for most tasks. Use top-k as a safety ceiling on top of it.

Predict this: If you applied temperature=0.5 first (which sharpens the distribution), then top-p=0.9 — would the nucleus be bigger or smaller than at temperature=1.0?

Smaller. A sharper distribution means fewer tokens cover 90% of the mass.

Exercise 2: Build a Nucleus Size Calculator

{
  type: 'exercise',
  id: 'nucleus-size-ex2',
  title: 'Exercise 2: Nucleus Size at Different Temperatures',
  difficulty: 'intermediate',
  exerciseType: 'write',
  instructions: 'Write a function `nucleus_size(logits, temperature, p)` that returns how many tokens are in the nucleus (top-p set) for a given temperature and p threshold.\n\nApply temperature scaling first, compute softmax probabilities, sort descending, compute cumulative sum, and count how many tokens it takes to reach the cumulative threshold p.',
  starterCode: 'def nucleus_size(logits, temperature, p):\n    """Count tokens in the nucleus after temperature scaling."""\n    scaled = logits / temperature\n    probs = softmax(scaled)\n    sorted_probs = np.sort(probs)[::-1]\n    cumulative = 0.0\n    count = 0\n    # YOUR CODE HERE: loop through sorted_probs\n    return count\n\n# Test\ntest_logits = np.array([2.0, 1.0, 1.5, 0.5, -0.5, -1.0, 0.3, -0.2])\nprint(nucleus_size(test_logits, 0.5, 0.9))\nprint(nucleus_size(test_logits, 2.0, 0.9))',
  testCases: [
    { id: 'tc1', input: 'test_logits = np.array([2.0, 1.0, 1.5, 0.5, -0.5, -1.0, 0.3, -0.2])\nprint(nucleus_size(test_logits, 0.5, 0.9))', expectedOutput: '3', description: 'Low temperature — nucleus should be small' },
    { id: 'tc2', input: 'test_logits = np.array([2.0, 1.0, 1.5, 0.5, -0.5, -1.0, 0.3, -0.2])\nprint(nucleus_size(test_logits, 2.0, 0.9))', expectedOutput: '7', description: 'High temperature — nucleus should be larger' },
  ],
  hints: [
    'Use a for loop: for prob in sorted_probs: cumulative += prob; count += 1; if cumulative >= p: break',
    'Full solution body: for prob in sorted_probs:\n    cumulative += prob\n    count += 1\n    if cumulative >= p:\n        break\nreturn count',
  ],
  solution: 'def nucleus_size(logits, temperature, p):\n    scaled = logits / temperature\n    probs = softmax(scaled)\n    sorted_probs = np.sort(probs)[::-1]\n    cumulative = 0.0\n    count = 0\n    for prob in sorted_probs:\n        cumulative += prob\n        count += 1\n        if cumulative >= p:\n            break\n    return count',
  solutionExplanation: 'Temperature scaling changes how peaked the distribution is. At low temperature (0.5), the top tokens grab most of the mass, so fewer tokens reach 90%. At high temperature (2.0), the distribution flattens, requiring more tokens to accumulate 90% of probability.',
  xpReward: 15,
}

Combining Parameters: Temperature + Top-k + Top-p

In most API calls, you can set all three at once. The order of operations matters — and varies slightly by provider. The most common convention is:

  1. Temperature is applied first — reshapes logits before softmax
  2. Top-k is applied next — removes low-ranking tokens
  3. Top-p is applied last — trims to a cumulative threshold

The sample_with_params() function chains all three steps, then picks one token. We'll run 1000 samples with different combos to see how diversity changes.

def sample_with_params(logits, tokens, temperature=1.0, top_k=None, top_p=None):
    """Full sampling pipeline: temperature -> top-k -> top-p -> sample."""
    scaled_logits = logits / temperature if temperature > 0 else logits

    if top_k is not None:
        scaled_logits = top_k_filter(scaled_logits, top_k)
    if top_p is not None:
        scaled_logits, _ = top_p_filter(scaled_logits, top_p)

    final_probs = softmax(scaled_logits)
    chosen_idx = np.random.choice(len(tokens), p=final_probs)
    return tokens[chosen_idx], final_probs

np.random.seed(42)
configs = [
    {"temperature": 0.3, "top_k": None, "top_p": None},
    {"temperature": 1.0, "top_k": 5,    "top_p": None},
    {"temperature": 0.7, "top_k": None, "top_p": 0.9},
    {"temperature": 0.7, "top_k": 5,    "top_p": 0.9},
]

for config in configs:
    counts = {}
    for _ in range(1000):
        token, _ = sample_with_params(logits, tokens, **config)
        counts[token] = counts.get(token, 0) + 1

    label = f"T={config['temperature']}"
    if config['top_k']:
        label += f", k={config['top_k']}"
    if config['top_p']:
        label += f", p={config['top_p']}"

    sorted_counts = sorted(counts.items(), key=lambda x: -x[1])
    top3 = ", ".join(f"{t}: {c/10:.1f}%" for t, c in sorted_counts[:3])
    unique = len(counts)
    print(f"{label:<20} -> {unique} unique | Top 3: {top3}")

The pattern is clear. Low temperature concentrates mass on the top token. Adding top-k or top-p narrows the pool further. Combining all three gives you the most controlled output.

Visualizing the Full Picture

A side-by-side bar chart makes the differences jump out. Each subplot shows each token's probability under a different configuration. Blue bars are active tokens; gray bars were eliminated.

import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
fig.suptitle("How Sampling Parameters Reshape Probabilities", fontsize=14)

configs_viz = [
    ("Default (T=1.0)", {"temperature": 1.0}),
    ("Low temp (T=0.3)", {"temperature": 0.3}),
    ("Top-k=3", {"temperature": 1.0, "top_k": 3}),
    ("Top-p=0.8", {"temperature": 1.0, "top_p": 0.8}),
]

for ax, (title, params) in zip(axes.flat, configs_viz):
    temp = params.get("temperature", 1.0)
    scaled = logits / temp
    if "top_k" in params:
        scaled = top_k_filter(scaled, params["top_k"])
    if "top_p" in params:
        scaled, _ = top_p_filter(scaled, params["top_p"])
    probs_viz = softmax(scaled)

    colors = ["steelblue" if p > 0.001 else "lightgray" for p in probs_viz]
    ax.bar(tokens, probs_viz, color=colors)
    ax.set_title(title)
    ax.set_ylim(0, 0.7)
    ax.set_ylabel("Probability")

plt.tight_layout()
plt.show()

Default softmax spreads probability across all 8 tokens. Low temperature creates a sharp spike on "the". Top-k=3 keeps only three bars. Top-p=0.8 keeps four bars, adapting to the actual distribution shape.

TIP: Top-p and top-k behave differently when distributions change. With a flat distribution (uncertain model), top-p keeps more tokens. With a peaked one (confident model), top-p keeps fewer. That's its advantage.

The Sampling Playground: Experiment Yourself

Here's the fun part. The function below runs a full simulation. Set temperature, top-k, and top-p. It shows surviving tokens with an ASCII bar chart, then draws 20 random samples.

The sampling_playground() chains all three operations, prints the probability table, then samples tokens so you can see the variety at each setting.

def sampling_playground(logits, tokens, temperature=1.0, top_k=None, top_p=None, n_samples=20):
    """Interactive playground: see distribution + sample tokens."""
    print(f"=== Settings: T={temperature}", end="")
    if top_k: print(f", k={top_k}", end="")
    if top_p: print(f", p={top_p}", end="")
    print(" ===\n")

    scaled = logits / temperature
    if top_k:
        scaled = top_k_filter(scaled, top_k)
    if top_p:
        scaled, _ = top_p_filter(scaled, top_p)

    probs = softmax(scaled)
    active = [(tokens[i], probs[i]) for i in range(len(tokens)) if probs[i] > 0.001]

    print(f"Active tokens: {len(active)}")
    for t, p in sorted(active, key=lambda x: -x[1]):
        bar = "█" * int(p * 50)
        print(f"  {t:<10} {p:.4f}  {bar}")

    samples = [tokens[np.random.choice(len(tokens), p=probs)] for _ in range(n_samples)]
    print(f"\n{n_samples} samples: {', '.join(samples)}\n")

np.random.seed(42)
sampling_playground(logits, tokens, temperature=0.5, top_k=5, top_p=0.9)
sampling_playground(logits, tokens, temperature=1.5, top_k=None, top_p=0.95)

The contrast is dramatic. The conservative setting produces repetitive output dominated by "the". The liberal setting keeps many tokens active and produces varied sequences.

This is exactly the kind of experimentation you should do before tuning API parameters. Build a toy simulation first. See the diversity. Then decide what fits.

Exercise 3: Complete Sampling Pipeline

{
  type: 'exercise',
  id: 'full-pipeline-ex3',
  title: 'Exercise 3: Sample With Full Pipeline',
  difficulty: 'intermediate',
  exerciseType: 'write',
  instructions: 'Write a function `count_unique_tokens(logits, tokens, temperature, top_k, top_p, n_samples)` that runs the full sampling pipeline n_samples times and returns the number of unique tokens sampled.\n\nUse the helper functions already defined: `softmax()`, `top_k_filter()`, and `top_p_filter()`.',
  starterCode: 'def count_unique_tokens(logits, tokens, temperature=1.0, top_k=None, top_p=None, n_samples=500):\n    """Run sampling n_samples times, return count of unique tokens."""\n    np.random.seed(0)\n    seen = set()\n    for _ in range(n_samples):\n        scaled = logits / temperature\n        # Apply top-k if specified\n        # Apply top-p if specified\n        # Compute final probs and sample\n        # Add sampled token to seen\n        pass\n    return len(seen)\n\ntokens_test = ["the", "a", "cat", "dog", "sat", "ran", "on", "happy"]\nlogits_test = np.array([2.0, 1.0, 1.5, 0.5, -0.5, -1.0, 0.3, -0.2])\nprint(count_unique_tokens(logits_test, tokens_test, temperature=0.3, top_k=3))',
  testCases: [
    { id: 'tc1', input: 'np.random.seed(0)\nprint(count_unique_tokens(logits_test, tokens_test, temperature=0.3, top_k=3))', expectedOutput: '3', description: 'With top_k=3, at most 3 unique tokens can appear' },
    { id: 'tc2', input: 'np.random.seed(0)\nprint(count_unique_tokens(logits_test, tokens_test, temperature=2.0, top_p=0.99))', expectedOutput: '8', description: 'High temp + high top_p should produce all 8 tokens' },
  ],
  hints: [
    'After scaling, use: if top_k: scaled = top_k_filter(scaled, top_k). Same pattern for top_p.',
    'After filtering: probs = softmax(scaled); idx = np.random.choice(len(tokens), p=probs); seen.add(tokens[idx])',
  ],
  solution: 'def count_unique_tokens(logits, tokens, temperature=1.0, top_k=None, top_p=None, n_samples=500):\n    np.random.seed(0)\n    seen = set()\n    for _ in range(n_samples):\n        scaled = logits / temperature\n        if top_k:\n            scaled = top_k_filter(scaled, top_k)\n        if top_p:\n            scaled, _ = top_p_filter(scaled, top_p)\n        probs = softmax(scaled)\n        idx = np.random.choice(len(tokens), p=probs)\n        seen.add(tokens[idx])\n    return len(seen)',
  solutionExplanation: 'The pipeline chains temperature -> top-k -> top-p -> sample. With aggressive filtering (low temp + small k), only a handful of tokens can ever be sampled. With permissive settings (high temp + high p), the full vocabulary opens up.',
  xpReward: 20,
}

When to Use Each Parameter: A Practical Guide

I've seen developers default to temperature=0.7 for everything. That works surprisingly often, but you're leaving control on the table.

Temperature is your primary dial. It controls overall creativity vs. determinism.

Top-k is a safety net. It stops the model from sampling extreme tail tokens. I use \(k=40\) as a default and rarely change it.

Top-p is the smart filter. It adapts to the model's confidence automatically. For most tasks, \(p=0.9\) is a solid starting point.

TaskTemperatureTop-kTop-pWhy
Code generation0.0 -- 0.210--200.8You want correctness, not creativity
Factual Q&A0.0 -- 0.320--400.85Accuracy matters most
Summarization0.3 -- 0.5400.9Slight variation is fine
Creative writing0.7 -- 1.050--1000.95Diverse word choices improve quality
Brainstorming1.0 -- 1.5100+0.98You want wild, unexpected ideas

WARNING: Don't set temperature high AND top-p close to 1.0 simultaneously. That produces chaotic output. Raise one, not both. Doing both at once usually hurts.

Common Mistakes and How to Fix Them

Mistake 1: Setting temperature to 0 and expecting variety

Wrong:

# "Why does the model always give the same answer?"
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku"}],
    temperature=0.0  # greedy decoding — ZERO randomness
)

Why it's wrong: Temperature 0 means greedy decoding. Same input, same output. Always.

Fix: Use temperature=0.8 or higher for creative tasks.


Mistake 2: Maxing out both temperature and top-p

Wrong:

# Double randomness — chaotic output
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize this"}],
    temperature=1.5,
    top_p=0.99
)

Why it's wrong: High temperature flattens the distribution. High top-p keeps nearly all tokens. The model samples almost uniformly. Output reads like word salad.

Fix: Raise one, not both. Try temperature=0.9, top_p=0.9.


Mistake 3: Adding parameters you haven't tested

Some providers set top-p to 1.0 by default. That means it has no effect. If temperature alone gives good results, don't pile on top-k and top-p "just because." Every parameter adds complexity. Add it only when testing confirms it helps.

Frequently Asked Questions

Can I use temperature and top-p together?

Yes. Most APIs accept both. Temperature reshapes the distribution first, then top-p trims it. I recommend temperature as your primary control. Only tweak top-p when you need finer tail behavior.

What's the difference between top-p and min-p?

Min-p is a newer method. It sets a minimum probability threshold relative to the top token. Say the top token has probability 0.8 and you set min-p=0.1. Only tokens above 0.08 (10% of 0.8) survive. It doesn't need sorting, so it's faster. Several open-source engines support it.

Do these parameters affect reasoning quality?

Indirectly, yes. For chain-of-thought reasoning, lower temperature (0.0--0.3) produces more consistent logic. Higher temperatures can cause the model to lose its reasoning thread. For math, code, or structured outputs, keep temperature low.

What happens if I set top-k=1?

That's greedy decoding. You always pick the single most probable token. Same effect as temperature=0, though the mechanism differs. Output is completely deterministic.

Summary

Sampling parameters control the tradeoff between predictability and diversity. Here's the core mental model:

  • Temperature reshapes logits before softmax. Low sharpens, high flattens.
  • Top-k keeps the $k$ highest-probability tokens. Simple but rigid.
  • Top-p keeps the smallest set covering cumulative probability $p$. It adapts to the model's confidence.
  • In practice, use temperature as your primary knob. Add top-p for fine-tuning. Use top-k as a safety net.

Practice exercise: Build a function that finds the "crossover temperature" — the $T$ where the second-most-likely token's probability exceeds 25%. This tells you how much temperature it takes to make the model genuinely uncertain.

Click to see solution
def find_crossover_temperature(logits, threshold=0.25):
    """Find temperature where 2nd token exceeds threshold."""
    second_idx = np.argsort(logits)[-2]
    for temp in np.arange(0.1, 5.0, 0.1):
        probs = softmax(logits / temp)
        if probs[second_idx] >= threshold:
            return round(temp, 1)
    return None

logits_test = np.array([2.0, 1.0, 1.5, 0.5, -0.5, -1.0, 0.3, -0.2])
crossover = find_crossover_temperature(logits_test)
print(f"Crossover temperature: {crossover}")

The function sweeps from 0.1 upward. At each step, it checks the second-ranked token. For our logits, "cat" hits 25% around \(T=2.1\).

Complete Code

Click to expand the full script (copy-paste and run)
# Complete code from: LLM Sampling — Temperature, Top-k, and Top-p
# Requires: pip install numpy matplotlib
# Python 3.9+

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# --- Mock Vocabulary ---
tokens = ["the", "a", "cat", "dog", "sat", "ran", "on", "happy"]
logits = np.array([2.0, 1.0, 1.5, 0.5, -0.5, -1.0, 0.3, -0.2])

# --- Core Functions ---
def softmax(logits):
    exp_logits = np.exp(logits - np.max(logits))
    return exp_logits / exp_logits.sum()

def softmax_with_temperature(logits, temperature):
    return softmax(logits / temperature)

def top_k_filter(logits, k):
    top_k_indices = np.argsort(logits)[-k:]
    filtered = np.full_like(logits, -np.inf)
    filtered[top_k_indices] = logits[top_k_indices]
    return filtered

def top_p_filter(logits, p):
    probs = softmax(logits)
    sorted_indices = np.argsort(probs)[::-1]
    sorted_probs = probs[sorted_indices]
    cumulative = np.cumsum(sorted_probs)
    cutoff_idx = min(np.searchsorted(cumulative, p) + 1, len(probs))
    kept_indices = sorted_indices[:cutoff_idx]
    filtered = np.full_like(logits, -np.inf)
    filtered[kept_indices] = logits[kept_indices]
    return filtered, kept_indices

def sample_with_params(logits, tokens, temperature=1.0, top_k=None, top_p=None):
    scaled_logits = logits / temperature if temperature > 0 else logits
    if top_k is not None:
        scaled_logits = top_k_filter(scaled_logits, top_k)
    if top_p is not None:
        scaled_logits, _ = top_p_filter(scaled_logits, top_p)
    final_probs = softmax(scaled_logits)
    chosen_idx = np.random.choice(len(tokens), p=final_probs)
    return tokens[chosen_idx], final_probs

# --- Section 1: Softmax ---
probs = softmax(logits)
print("Token        Logit    Probability")
print("-" * 38)
for token, logit, prob in zip(tokens, logits, probs):
    print(f"{token:<12} {logit:>6.1f}    {prob:.4f}")

# --- Section 2: Temperature ---
print("\n--- Temperature Comparison ---")
for t in [0.5, 1.0, 2.0]:
    p = softmax_with_temperature(logits, t)
    top_token = tokens[np.argmax(p)]
    print(f"T={t}: top='{top_token}' at {p.max():.4f}")

# --- Section 3: Top-k ---
print("\n--- Top-k Filtering ---")
for k in [3, 5]:
    filtered = top_k_filter(logits, k)
    p = softmax(filtered)
    active = sum(1 for x in p if x > 0.001)
    print(f"k={k}: {active} active tokens")

# --- Section 4: Top-p ---
print("\n--- Top-p Filtering ---")
for p_val in [0.8, 0.95]:
    filtered, kept = top_p_filter(logits, p_val)
    print(f"p={p_val}: nucleus size = {len(kept)}")

# --- Section 5: Visualization ---
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
fig.suptitle("How Sampling Parameters Reshape Probabilities", fontsize=14)
configs_viz = [
    ("Default (T=1.0)", {"temperature": 1.0}),
    ("Low temp (T=0.3)", {"temperature": 0.3}),
    ("Top-k=3", {"temperature": 1.0, "top_k": 3}),
    ("Top-p=0.8", {"temperature": 1.0, "top_p": 0.8}),
]
for ax, (title, params) in zip(axes.flat, configs_viz):
    temp = params.get("temperature", 1.0)
    scaled = logits / temp
    if "top_k" in params:
        scaled = top_k_filter(scaled, params["top_k"])
    if "top_p" in params:
        scaled, _ = top_p_filter(scaled, params["top_p"])
    p = softmax(scaled)
    colors = ["steelblue" if x > 0.001 else "lightgray" for x in p]
    ax.bar(tokens, p, color=colors)
    ax.set_title(title)
    ax.set_ylim(0, 0.7)
    ax.set_ylabel("Probability")
plt.tight_layout()
plt.show()

print("\nScript completed successfully.")

References

  1. Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). "The Curious Case of Neural Text Degeneration." ICLR 2020. Link
  2. OpenAI API Reference — Chat Completions. Link
  3. Fan, A., Lewis, M., & Dauphin, Y. (2018). "Hierarchical Neural Story Generation." ACL 2018. Link — Introduced top-k sampling.
  4. Google Cloud — Generative AI: Configure model parameters. Link
  5. Anthropic Documentation — Sampling parameters. Link
  6. Hugging Face — How to generate text: using different decoding methods. Link
  7. Chip Huyen (2024). "Generation configurations: temperature, top-k, top-p, and test time compute." Link
  8. Let's Data Science — "LLM Sampling Parameters Explained: Intuition to Math." Link
Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Free Callback - Limited Slots
Not Sure Which Course to Start With?
Talk to our AI Counsellors and Practitioners. We'll help you clear all your questions for your background and goals, bridging the gap between your current skills and a career in AI.
10-digit mobile number
📞
Thank You!
We'll Call You Soon!
Our learning advisor will reach out within 24 hours.
(Check your inbox too — we've sent a confirmation)
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science