Temperature, Top-p & Top-k in LLMs Explained (Python)

Master LLM temperature, top-k, and top-p with interactive Python simulations. Runnable code, exercises, and a sampling playground to build real intuition.

Written by Selva Prabhakaran | 23 min read

This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.

Build interactive NumPy simulations that show how temperature, top-k, and nucleus sampling reshape a language model’s word choices.

Every time you call an LLM API, you pass parameters like temperature=0.7 or top_p=0.9. You’ve probably tweaked them, noticed vaguely different outputs, and moved on. But do you actually know what they do to the probability distribution?

Most people don’t. They’re tuning knobs blindfolded.

This article fixes that. We’ll build NumPy simulations that let you see — with real numbers — how each parameter reshapes next-token probabilities. No API keys needed. Everything runs in pure Python.

What Happens Before Sampling: Logits and Softmax

Before we touch any sampling parameter, you need to know where probabilities come from. I find this is the step most tutorials skip, and it causes confusion later.

An LLM doesn’t output probabilities directly. It outputs logits — raw scores, one per token in its vocabulary. A positive logit means “likely.” A negative logit means “unlikely.”

But logits aren’t probabilities. They don’t sum to 1.

The softmax function fixes that. It exponentiates each logit, then divides by the total. Larger logits get bigger probabilities. Smaller ones get squeezed toward zero.

Let’s build this from scratch. We’ll create a mock vocabulary of 8 tokens with hand-picked logits, then apply softmax. The softmax() function exponentiates each logit and normalizes so everything sums to 1.

import numpy as np

np.random.seed(42)

# Mock vocabulary: 8 tokens with raw logit scores
tokens = ["the", "a", "cat", "dog", "sat", "ran", "on", "happy"]
logits = np.array([2.0, 1.0, 1.5, 0.5, -0.5, -1.0, 0.3, -0.2])

def softmax(logits):
    """Convert raw logits to probabilities."""
    exp_logits = np.exp(logits - np.max(logits))
    return exp_logits / exp_logits.sum()

probs = softmax(logits)

print("Token        Logit    Probability")
print("-" * 38)
for token, logit, prob in zip(tokens, logits, probs):
    print(f"{token:<12} {logit:>6.1f}    {prob:.4f}")
print(f"\nSum of probabilities: {probs.sum():.4f}")

Output:

python

Token        Logit    Probability
--------------------------------------
the           2.0    0.3813
a             1.0    0.1403
cat           1.5    0.2313
dog           0.5    0.0851
sat          -0.5    0.0313
ran          -1.0    0.0190
on            0.3    0.0697
happy        -0.2    0.0423

Sum of probabilities: 1.0000

“the” (logit 2.0) grabbed 38% of the probability. “ran” (logit -1.0) got just 1.9%. Softmax preserves the ranking but creates an exponential gap between high and low scorers.

KEY INSIGHT: Every sampling parameter works by modifying either the logits before softmax or the probabilities after softmax. Understanding softmax is the foundation for everything else in this article.

Temperature: The Confidence Dial

Here’s a question that trips up most people: what does temperature actually do?

It divides every logit by a single number before softmax runs. That’s it. The formula is:

p_i = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}

Where:
– $p_i$ = probability of token $i$
– $z_i$ = the raw logit for token $i$
– $T$ = temperature (a positive number)
– The sum runs over all tokens in the vocabulary

If math isn’t your thing, skip to the code below — it tells the same story with real numbers.

When $T = 1.0$, nothing changes. You get the default softmax. When $T < 1.0$, you divide by a small number. That amplifies the gaps between logits. The top token dominates even more.

When $T > 1.0$, you divide by a large number. That compresses logits toward zero. The distribution flattens, giving low-probability tokens a better chance.

Quick check: Before you run the next block, predict this. If we set $T=0.5$, will "the" get more or less than its original 38.1%?

We'll apply softmax at three temperatures — 0.5 (sharp), 1.0 (default), and 2.0 (flat). The softmax_with_temperature() function divides logits by $T$ before calling softmax.

def softmax_with_temperature(logits, temperature):
    """Apply temperature scaling before softmax."""
    scaled = logits / temperature
    return softmax(scaled)

temperatures = [0.5, 1.0, 2.0]

print(f"{'Token':<10}", end="")
for t in temperatures:
    print(f"  T={t:<4}", end="")
print()
print("-" * 44)

for i, token in enumerate(tokens):
    print(f"{token:<10}", end="")
    for t in temperatures:
        p = softmax_with_temperature(logits, t)
        print(f"  {p[i]:.4f}", end="")
    print()

Output:

python

Token       T=0.5   T=1.0   T=2.0
--------------------------------------------
the         0.6220  0.3813  0.2423
a           0.0842  0.1403  0.1470
cat         0.2287  0.2313  0.1887
dog         0.0310  0.0851  0.1145
sat         0.0042  0.0313  0.0694
ran         0.0015  0.0190  0.0541
on          0.0208  0.0697  0.1036
happy       0.0076  0.0423  0.0807

If you predicted "more" — you're right. At $T=0.5$, "the" jumps from 38.1% to 62.2%. Meanwhile, "ran" shrinks from 1.9% to 0.15%. That's a 42x difference between high and low temperature for a single token.

That's the tradeoff. Low temperature sharpens. High temperature flattens.

TIP: Use temperature=0.0 to 0.3 for factual Q&A and code generation. Use 0.7 to 1.0 for creative writing. Going above 1.5 is rarely useful — the output gets incoherent.

What Does Temperature = 0 Mean?

At $T = 0$, the model always picks the highest-logit token. No randomness at all. This is called greedy decoding. APIs handle $T=0$ as a special case — they skip sampling and return the argmax.

def greedy_decode(logits, tokens):
    """Temperature = 0: always pick the highest logit."""
    best_idx = np.argmax(logits)
    return tokens[best_idx], logits[best_idx]

winner, score = greedy_decode(logits, tokens)
print(f"Greedy pick: '{winner}' (logit: {score:.1f})")

python

Greedy pick: 'the' (logit: 2.0)

No surprises. "the" had the highest logit. Greedy decoding always picks it. Every single time. Zero variety.

Now that you understand how temperature reshapes the distribution, let's practice.

Exercise 1: Build a Temperature Sweep

{
  type: 'exercise',
  id: 'temp-sweep-ex1',
  title: 'Exercise 1: Build a Temperature Sweep',
  difficulty: 'intermediate',
  exerciseType: 'write',
  instructions: 'Write a function `temperature_sweep(logits, temps)` that takes a logit array and a list of temperatures. For each temperature, compute the softmax probabilities and return the probability of the *highest-logit* token at each temperature. Return a list of floats.\n\nFor example, with logits `[2.0, 1.0, 0.5]` and temps `[0.5, 1.0, 2.0]`, the highest-logit token is index 0. Return its probability at each temperature.',
  starterCode: 'def temperature_sweep(logits, temps):\n    """Return the probability of the top token at each temperature."""\n    top_idx = np.argmax(logits)\n    result = []\n    for t in temps:\n        # Apply temperature scaling and softmax\n        # Append probability of top_idx to result\n        pass\n    return result\n\n# Test\nlogits_test = np.array([2.0, 1.0, 0.5])\ntemps_test = [0.5, 1.0, 2.0]\nprint(temperature_sweep(logits_test, temps_test))',
  testCases: [
    { id: 'tc1', input: 'logits_test = np.array([2.0, 1.0, 0.5])\nprint([round(x, 4) for x in temperature_sweep(logits_test, [0.5, 1.0, 2.0])])', expectedOutput: '[0.8438, 0.6590, 0.5066]', description: 'Standard test with three temperatures' },
    { id: 'tc2', input: 'logits_test = np.array([3.0, 3.0, 3.0])\nprint([round(x, 4) for x in temperature_sweep(logits_test, [0.5, 1.0])])', expectedOutput: '[0.3333, 0.3333]', description: 'Equal logits — temperature should not matter' },
  ],
  hints: [
    'Inside the loop, compute scaled_logits = logits / t, then apply softmax to get probabilities.',
    'Full line: probs = softmax(logits / t); result.append(probs[top_idx])',
  ],
  solution: 'def temperature_sweep(logits, temps):\n    top_idx = np.argmax(logits)\n    result = []\n    for t in temps:\n        probs = softmax(logits / t)\n        result.append(probs[top_idx])\n    return result',
  solutionExplanation: 'For each temperature, divide the logits by T, apply softmax, and grab the probability at the index of the highest original logit. When logits are equal, softmax returns uniform probabilities regardless of temperature — the top token always gets 1/n.',
  xpReward: 15,
}

Top-k Filtering: Limiting the Candidate Pool

Temperature adjusts how spread out the probabilities are. But even at low temperature, the model still considers every token in its vocabulary.

Why is that a problem? An LLM's vocabulary is typically 32,000 to 128,000 tokens. Even with low temperature, there's a long tail with tiny probabilities. On rare occasions, the model samples from that tail and produces nonsense.

Top-k filtering takes a different approach. It keeps only the $k$ most probable tokens and zeroes out everything else. Then it renormalizes so the survivors sum to 1.

The top_k_filter() function finds the $k$ largest logits. Everything else gets set to negative infinity, which softmax converts to zero probability.

def top_k_filter(logits, k):
    """Keep only the top-k logits; set the rest to -inf."""
    top_k_indices = np.argsort(logits)[-k:]
    filtered = np.full_like(logits, -np.inf)
    filtered[top_k_indices] = logits[top_k_indices]
    return filtered

for k in [3, 5]:
    filtered = top_k_filter(logits, k)
    probs = softmax(filtered)
    print(f"\nTop-{k} filtering:")
    print(f"{'Token':<10} {'Logit':>8} {'Prob':>8}")
    print("-" * 28)
    for token, logit, prob in zip(tokens, filtered, probs):
        if prob > 0:
            print(f"{token:<10} {logit:>8.1f} {prob:>8.4f}")

Output:

python

Top-3 filtering:
Token         Logit     Prob
----------------------------
the            2.0   0.5066
a              1.0   0.1863
cat            1.5   0.3072

Top-5 filtering:
Token         Logit     Prob
----------------------------
the            2.0   0.4202
a              1.0   0.1546
cat            1.5   0.2549
dog            0.5   0.0937
on             0.3   0.0768

With $k=3$, only "the", "cat", and "a" survive. "the" jumps from 38.1% to 50.7% because it no longer shares mass with five eliminated tokens.

WARNING: Setting $k$ too low (like 1 or 2) makes output extremely repetitive. Setting it too high (like 1000) has almost no effect. A typical value is $k=40$ to $50$.

The Problem with Top-k: One Size Doesn't Fit All

Here's a subtlety most tutorials skip. Top-k uses a fixed cutoff regardless of context.

Sometimes the model is confident — one token has 90% probability. Even $k=5$ is too generous, including tokens the model doesn't want.

Other times, the model is uncertain. The top 20 tokens each have about 5%. With $k=5$, you'd cut out 15 perfectly valid options.

Top-k can't adapt. That's exactly what top-p solves.

Top-p (Nucleus Sampling): The Adaptive Cutoff

Top-p, also called nucleus sampling, was introduced by Holtzman et al. (2019). It doesn't keep a fixed count of tokens. Instead, it keeps the smallest set whose cumulative probability reaches a threshold $p$.

Here's how it works. Sort tokens by probability, highest to lowest. Walk down the list, adding up probabilities. When the running total crosses $p$, stop. Everything below gets eliminated.

This adapts automatically. When the model is confident, the nucleus is tiny — maybe 2-3 tokens. When uncertain, it grows to include many options.

The top_p_filter() function sorts tokens by probability and computes a running total. np.searchsorted finds where the total first reaches $p$. Everything past that point becomes negative infinity.

def top_p_filter(logits, p):
    """Keep the smallest set of tokens with cumulative prob >= p."""
    probs = softmax(logits)
    sorted_indices = np.argsort(probs)[::-1]
    sorted_probs = probs[sorted_indices]
    cumulative = np.cumsum(sorted_probs)

    cutoff_idx = np.searchsorted(cumulative, p) + 1
    cutoff_idx = min(cutoff_idx, len(probs))

    kept_indices = sorted_indices[:cutoff_idx]
    filtered = np.full_like(logits, -np.inf)
    filtered[kept_indices] = logits[kept_indices]
    return filtered, kept_indices

for p_val in [0.8, 0.95]:
    filtered, kept = top_p_filter(logits, p_val)
    probs_filtered = softmax(filtered)
    print(f"\nTop-p = {p_val}  (nucleus size: {len(kept)} tokens)")
    print(f"{'Token':<10} {'Prob':>8}")
    print("-" * 20)
    for idx in kept:
        print(f"{tokens[idx]:<10} {probs_filtered[idx]:>8.4f}")

Output:

python

Top-p = 0.8  (nucleus size: 4 tokens)
Token          Prob
--------------------
the         0.4551
cat         0.2760
a           0.1674
dog         0.1015

Top-p = 0.95  (nucleus size: 6 tokens)
Token          Prob
--------------------
the         0.4014
cat         0.2435
a           0.1477
dog         0.0896
on          0.0733
happy       0.0445

With $p=0.8$, four tokens make the cut. Cumulative: "the"(0.38) + "cat"(0.23) + "a"(0.14) = 0.75. That's under 0.80. Adding "dog"(0.09) pushes it to 0.84 — past the threshold.

At $p=0.95$, the nucleus grows to 6 tokens. The distribution needs more options to cover 95% of the mass.

KEY INSIGHT: Top-p adapts to the model's confidence. When one token dominates, the nucleus shrinks. When the model is uncertain, it grows. This is why top-p generally produces more natural text than a fixed top-k.

Top-k vs Top-p: When Does Each Win?

Scenario	Top-k	Top-p	Winner
Model is confident (one token dominates)	Keeps k tokens even though most are unwanted	Shrinks to 2-3 tokens automatically	Top-p
Model is uncertain (flat distribution)	Cuts valid options if k is too small	Grows to include all reasonable tokens	Top-p
You need a hard ceiling on candidates	Guarantees at most k tokens	Nucleus size varies per step	Top-k
Computational simplicity	Just sort and slice	Requires cumulative sum + threshold	Top-k

In practice, top-p is the better default for most tasks. Use top-k as a safety ceiling on top of it.

Predict this: If you applied temperature=0.5 first (which sharpens the distribution), then top-p=0.9 — would the nucleus be bigger or smaller than at temperature=1.0?

Smaller. A sharper distribution means fewer tokens cover 90% of the mass.

Exercise 2: Build a Nucleus Size Calculator

{
  type: 'exercise',
  id: 'nucleus-size-ex2',
  title: 'Exercise 2: Nucleus Size at Different Temperatures',
  difficulty: 'intermediate',
  exerciseType: 'write',
  instructions: 'Write a function `nucleus_size(logits, temperature, p)` that returns how many tokens are in the nucleus (top-p set) for a given temperature and p threshold.\n\nApply temperature scaling first, compute softmax probabilities, sort descending, compute cumulative sum, and count how many tokens it takes to reach the cumulative threshold p.',
  starterCode: 'def nucleus_size(logits, temperature, p):\n    """Count tokens in the nucleus after temperature scaling."""\n    scaled = logits / temperature\n    probs = softmax(scaled)\n    sorted_probs = np.sort(probs)[::-1]\n    cumulative = 0.0\n    count = 0\n    # YOUR CODE HERE: loop through sorted_probs\n    return count\n\n# Test\ntest_logits = np.array([2.0, 1.0, 1.5, 0.5, -0.5, -1.0, 0.3, -0.2])\nprint(nucleus_size(test_logits, 0.5, 0.9))\nprint(nucleus_size(test_logits, 2.0, 0.9))',
  testCases: [
    { id: 'tc1', input: 'test_logits = np.array([2.0, 1.0, 1.5, 0.5, -0.5, -1.0, 0.3, -0.2])\nprint(nucleus_size(test_logits, 0.5, 0.9))', expectedOutput: '3', description: 'Low temperature — nucleus should be small' },
    { id: 'tc2', input: 'test_logits = np.array([2.0, 1.0, 1.5, 0.5, -0.5, -1.0, 0.3, -0.2])\nprint(nucleus_size(test_logits, 2.0, 0.9))', expectedOutput: '7', description: 'High temperature — nucleus should be larger' },
  ],
  hints: [
    'Use a for loop: for prob in sorted_probs: cumulative += prob; count += 1; if cumulative >= p: break',
    'Full solution body: for prob in sorted_probs:\n    cumulative += prob\n    count += 1\n    if cumulative >= p:\n        break\nreturn count',
  ],
  solution: 'def nucleus_size(logits, temperature, p):\n    scaled = logits / temperature\n    probs = softmax(scaled)\n    sorted_probs = np.sort(probs)[::-1]\n    cumulative = 0.0\n    count = 0\n    for prob in sorted_probs:\n        cumulative += prob\n        count += 1\n        if cumulative >= p:\n            break\n    return count',
  solutionExplanation: 'Temperature scaling changes how peaked the distribution is. At low temperature (0.5), the top tokens grab most of the mass, so fewer tokens reach 90%. At high temperature (2.0), the distribution flattens, requiring more tokens to accumulate 90% of probability.',
  xpReward: 15,
}

Combining Parameters: Temperature + Top-k + Top-p

In most API calls, you can set all three at once. The order of operations matters — and varies slightly by provider. The most common convention is:

Temperature is applied first — reshapes logits before softmax
Top-k is applied next — removes low-ranking tokens
Top-p is applied last — trims to a cumulative threshold

The sample_with_params() function chains all three steps, then picks one token. We'll run 1000 samples with different combos to see how diversity changes.

def sample_with_params(logits, tokens, temperature=1.0, top_k=None, top_p=None):
    """Full sampling pipeline: temperature -> top-k -> top-p -> sample."""
    scaled_logits = logits / temperature if temperature > 0 else logits

    if top_k is not None:
        scaled_logits = top_k_filter(scaled_logits, top_k)
    if top_p is not None:
        scaled_logits, _ = top_p_filter(scaled_logits, top_p)

    final_probs = softmax(scaled_logits)
    chosen_idx = np.random.choice(len(tokens), p=final_probs)
    return tokens[chosen_idx], final_probs

np.random.seed(42)
configs = [
    {"temperature": 0.3, "top_k": None, "top_p": None},
    {"temperature": 1.0, "top_k": 5,    "top_p": None},
    {"temperature": 0.7, "top_k": None, "top_p": 0.9},
    {"temperature": 0.7, "top_k": 5,    "top_p": 0.9},
]

for config in configs:
    counts = {}
    for _ in range(1000):
        token, _ = sample_with_params(logits, tokens, **config)
        counts[token] = counts.get(token, 0) + 1

    label = f"T={config['temperature']}"
    if config['top_k']:
        label += f", k={config['top_k']}"
    if config['top_p']:
        label += f", p={config['top_p']}"

    sorted_counts = sorted(counts.items(), key=lambda x: -x[1])
    top3 = ", ".join(f"{t}: {c/10:.1f}%" for t, c in sorted_counts[:3])
    unique = len(counts)
    print(f"{label:<20} -> {unique} unique | Top 3: {top3}")

The pattern is clear. Low temperature concentrates mass on the top token. Adding top-k or top-p narrows the pool further. Combining all three gives you the most controlled output.

Visualizing the Full Picture

A side-by-side bar chart makes the differences jump out. Each subplot shows each token's probability under a different configuration. Blue bars are active tokens; gray bars were eliminated.

import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
fig.suptitle("How Sampling Parameters Reshape Probabilities", fontsize=14)

configs_viz = [
    ("Default (T=1.0)", {"temperature": 1.0}),
    ("Low temp (T=0.3)", {"temperature": 0.3}),
    ("Top-k=3", {"temperature": 1.0, "top_k": 3}),
    ("Top-p=0.8", {"temperature": 1.0, "top_p": 0.8}),
]

for ax, (title, params) in zip(axes.flat, configs_viz):
    temp = params.get("temperature", 1.0)
    scaled = logits / temp
    if "top_k" in params:
        scaled = top_k_filter(scaled, params["top_k"])
    if "top_p" in params:
        scaled, _ = top_p_filter(scaled, params["top_p"])
    probs_viz = softmax(scaled)

    colors = ["steelblue" if p > 0.001 else "lightgray" for p in probs_viz]
    ax.bar(tokens, probs_viz, color=colors)
    ax.set_title(title)
    ax.set_ylim(0, 0.7)
    ax.set_ylabel("Probability")

plt.tight_layout()
plt.show()

Default softmax spreads probability across all 8 tokens. Low temperature creates a sharp spike on "the". Top-k=3 keeps only three bars. Top-p=0.8 keeps four bars, adapting to the actual distribution shape.

TIP: Top-p and top-k behave differently when distributions change. With a flat distribution (uncertain model), top-p keeps more tokens. With a peaked one (confident model), top-p keeps fewer. That's its advantage.

The Sampling Playground: Experiment Yourself

Here's the fun part. The function below runs a full simulation. Set temperature, top-k, and top-p. It shows surviving tokens with an ASCII bar chart, then draws 20 random samples.

The sampling_playground() chains all three operations, prints the probability table, then samples tokens so you can see the variety at each setting.

def sampling_playground(logits, tokens, temperature=1.0, top_k=None, top_p=None, n_samples=20):
    """Interactive playground: see distribution + sample tokens."""
    print(f"=== Settings: T={temperature}", end="")
    if top_k: print(f", k={top_k}", end="")
    if top_p: print(f", p={top_p}", end="")
    print(" ===\n")

    scaled = logits / temperature
    if top_k:
        scaled = top_k_filter(scaled, top_k)
    if top_p:
        scaled, _ = top_p_filter(scaled, top_p)

    probs = softmax(scaled)
    active = [(tokens[i], probs[i]) for i in range(len(tokens)) if probs[i] > 0.001]

    print(f"Active tokens: {len(active)}")
    for t, p in sorted(active, key=lambda x: -x[1]):
        bar = "█" * int(p * 50)
        print(f"  {t:<10} {p:.4f}  {bar}")

    samples = [tokens[np.random.choice(len(tokens), p=probs)] for _ in range(n_samples)]
    print(f"\n{n_samples} samples: {', '.join(samples)}\n")

np.random.seed(42)
sampling_playground(logits, tokens, temperature=0.5, top_k=5, top_p=0.9)
sampling_playground(logits, tokens, temperature=1.5, top_k=None, top_p=0.95)

The contrast is dramatic. The conservative setting produces repetitive output dominated by "the". The liberal setting keeps many tokens active and produces varied sequences.

This is exactly the kind of experimentation you should do before tuning API parameters. Build a toy simulation first. See the diversity. Then decide what fits.

Exercise 3: Complete Sampling Pipeline

{
  type: 'exercise',
  id: 'full-pipeline-ex3',
  title: 'Exercise 3: Sample With Full Pipeline',
  difficulty: 'intermediate',
  exerciseType: 'write',
  instructions: 'Write a function `count_unique_tokens(logits, tokens, temperature, top_k, top_p, n_samples)` that runs the full sampling pipeline n_samples times and returns the number of unique tokens sampled.\n\nUse the helper functions already defined: `softmax()`, `top_k_filter()`, and `top_p_filter()`.',
  starterCode: 'def count_unique_tokens(logits, tokens, temperature=1.0, top_k=None, top_p=None, n_samples=500):\n    """Run sampling n_samples times, return count of unique tokens."""\n    np.random.seed(0)\n    seen = set()\n    for _ in range(n_samples):\n        scaled = logits / temperature\n        # Apply top-k if specified\n        # Apply top-p if specified\n        # Compute final probs and sample\n        # Add sampled token to seen\n        pass\n    return len(seen)\n\ntokens_test = ["the", "a", "cat", "dog", "sat", "ran", "on", "happy"]\nlogits_test = np.array([2.0, 1.0, 1.5, 0.5, -0.5, -1.0, 0.3, -0.2])\nprint(count_unique_tokens(logits_test, tokens_test, temperature=0.3, top_k=3))',
  testCases: [
    { id: 'tc1', input: 'np.random.seed(0)\nprint(count_unique_tokens(logits_test, tokens_test, temperature=0.3, top_k=3))', expectedOutput: '3', description: 'With top_k=3, at most 3 unique tokens can appear' },
    { id: 'tc2', input: 'np.random.seed(0)\nprint(count_unique_tokens(logits_test, tokens_test, temperature=2.0, top_p=0.99))', expectedOutput: '8', description: 'High temp + high top_p should produce all 8 tokens' },
  ],
  hints: [
    'After scaling, use: if top_k: scaled = top_k_filter(scaled, top_k). Same pattern for top_p.',
    'After filtering: probs = softmax(scaled); idx = np.random.choice(len(tokens), p=probs); seen.add(tokens[idx])',
  ],
  solution: 'def count_unique_tokens(logits, tokens, temperature=1.0, top_k=None, top_p=None, n_samples=500):\n    np.random.seed(0)\n    seen = set()\n    for _ in range(n_samples):\n        scaled = logits / temperature\n        if top_k:\n            scaled = top_k_filter(scaled, top_k)\n        if top_p:\n            scaled, _ = top_p_filter(scaled, top_p)\n        probs = softmax(scaled)\n        idx = np.random.choice(len(tokens), p=probs)\n        seen.add(tokens[idx])\n    return len(seen)',
  solutionExplanation: 'The pipeline chains temperature -> top-k -> top-p -> sample. With aggressive filtering (low temp + small k), only a handful of tokens can ever be sampled. With permissive settings (high temp + high p), the full vocabulary opens up.',
  xpReward: 20,
}

When to Use Each Parameter: A Practical Guide

I've seen developers default to temperature=0.7 for everything. That works surprisingly often, but you're leaving control on the table.

Temperature is your primary dial. It controls overall creativity vs. determinism.

Top-k is a safety net. It stops the model from sampling extreme tail tokens. I use $k=40$ as a default and rarely change it.

Top-p is the smart filter. It adapts to the model's confidence automatically. For most tasks, $p=0.9$ is a solid starting point.

Task	Temperature	Top-k	Top-p	Why
Code generation	0.0 -- 0.2	10--20	0.8	You want correctness, not creativity
Factual Q&A	0.0 -- 0.3	20--40	0.85	Accuracy matters most
Summarization	0.3 -- 0.5	40	0.9	Slight variation is fine
Creative writing	0.7 -- 1.0	50--100	0.95	Diverse word choices improve quality
Brainstorming	1.0 -- 1.5	100+	0.98	You want wild, unexpected ideas

WARNING: Don't set temperature high AND top-p close to 1.0 simultaneously. That produces chaotic output. Raise one, not both. Doing both at once usually hurts.

Common Mistakes and How to Fix Them

Mistake 1: Setting temperature to 0 and expecting variety

Wrong:

# "Why does the model always give the same answer?"
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku"}],
    temperature=0.0  # greedy decoding — ZERO randomness
)

Why it's wrong: Temperature 0 means greedy decoding. Same input, same output. Always.

Fix: Use temperature=0.8 or higher for creative tasks.

Mistake 2: Maxing out both temperature and top-p

Wrong:

# Double randomness — chaotic output
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize this"}],
    temperature=1.5,
    top_p=0.99
)

Why it's wrong: High temperature flattens the distribution. High top-p keeps nearly all tokens. The model samples almost uniformly. Output reads like word salad.

Fix: Raise one, not both. Try temperature=0.9, top_p=0.9.

Mistake 3: Adding parameters you haven't tested

Some providers set top-p to 1.0 by default. That means it has no effect. If temperature alone gives good results, don't pile on top-k and top-p "just because." Every parameter adds complexity. Add it only when testing confirms it helps.

Frequently Asked Questions

Can I use temperature and top-p together?

Yes. Most APIs accept both. Temperature reshapes the distribution first, then top-p trims it. I recommend temperature as your primary control. Only tweak top-p when you need finer tail behavior.

What's the difference between top-p and min-p?

Min-p is a newer method. It sets a minimum probability threshold relative to the top token. Say the top token has probability 0.8 and you set min-p=0.1. Only tokens above 0.08 (10% of 0.8) survive. It doesn't need sorting, so it's faster. Several open-source engines support it.

Do these parameters affect reasoning quality?

Indirectly, yes. For chain-of-thought reasoning, lower temperature (0.0--0.3) produces more consistent logic. Higher temperatures can cause the model to lose its reasoning thread. For math, code, or structured outputs, keep temperature low.

What happens if I set top-k=1?

That's greedy decoding. You always pick the single most probable token. Same effect as temperature=0, though the mechanism differs. Output is completely deterministic.

Summary

Sampling parameters control the tradeoff between predictability and diversity. Here's the core mental model:

Temperature reshapes logits before softmax. Low sharpens, high flattens.
Top-k keeps the $k$ highest-probability tokens. Simple but rigid.
Top-p keeps the smallest set covering cumulative probability $p$. It adapts to the model's confidence.
In practice, use temperature as your primary knob. Add top-p for fine-tuning. Use top-k as a safety net.

Practice exercise: Build a function that finds the "crossover temperature" — the $T$ where the second-most-likely token's probability exceeds 25%. This tells you how much temperature it takes to make the model genuinely uncertain.

Click to see solution

def find_crossover_temperature(logits, threshold=0.25):
    """Find temperature where 2nd token exceeds threshold."""
    second_idx = np.argsort(logits)[-2]
    for temp in np.arange(0.1, 5.0, 0.1):
        probs = softmax(logits / temp)
        if probs[second_idx] >= threshold:
            return round(temp, 1)
    return None

logits_test = np.array([2.0, 1.0, 1.5, 0.5, -0.5, -1.0, 0.3, -0.2])
crossover = find_crossover_temperature(logits_test)
print(f"Crossover temperature: {crossover}")

The function sweeps from 0.1 upward. At each step, it checks the second-ranked token. For our logits, "cat" hits 25% around $T=2.1$.

Complete Code

Click to expand the full script (copy-paste and run)

# Complete code from: LLM Sampling — Temperature, Top-k, and Top-p
# Requires: pip install numpy matplotlib
# Python 3.9+

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# --- Mock Vocabulary ---
tokens = ["the", "a", "cat", "dog", "sat", "ran", "on", "happy"]
logits = np.array([2.0, 1.0, 1.5, 0.5, -0.5, -1.0, 0.3, -0.2])

# --- Core Functions ---
def softmax(logits):
    exp_logits = np.exp(logits - np.max(logits))
    return exp_logits / exp_logits.sum()

def softmax_with_temperature(logits, temperature):
    return softmax(logits / temperature)

def top_k_filter(logits, k):
    top_k_indices = np.argsort(logits)[-k:]
    filtered = np.full_like(logits, -np.inf)
    filtered[top_k_indices] = logits[top_k_indices]
    return filtered

def top_p_filter(logits, p):
    probs = softmax(logits)
    sorted_indices = np.argsort(probs)[::-1]
    sorted_probs = probs[sorted_indices]
    cumulative = np.cumsum(sorted_probs)
    cutoff_idx = min(np.searchsorted(cumulative, p) + 1, len(probs))
    kept_indices = sorted_indices[:cutoff_idx]
    filtered = np.full_like(logits, -np.inf)
    filtered[kept_indices] = logits[kept_indices]
    return filtered, kept_indices

def sample_with_params(logits, tokens, temperature=1.0, top_k=None, top_p=None):
    scaled_logits = logits / temperature if temperature > 0 else logits
    if top_k is not None:
        scaled_logits = top_k_filter(scaled_logits, top_k)
    if top_p is not None:
        scaled_logits, _ = top_p_filter(scaled_logits, top_p)
    final_probs = softmax(scaled_logits)
    chosen_idx = np.random.choice(len(tokens), p=final_probs)
    return tokens[chosen_idx], final_probs

# --- Section 1: Softmax ---
probs = softmax(logits)
print("Token        Logit    Probability")
print("-" * 38)
for token, logit, prob in zip(tokens, logits, probs):
    print(f"{token:<12} {logit:>6.1f}    {prob:.4f}")

# --- Section 2: Temperature ---
print("\n--- Temperature Comparison ---")
for t in [0.5, 1.0, 2.0]:
    p = softmax_with_temperature(logits, t)
    top_token = tokens[np.argmax(p)]
    print(f"T={t}: top='{top_token}' at {p.max():.4f}")

# --- Section 3: Top-k ---
print("\n--- Top-k Filtering ---")
for k in [3, 5]:
    filtered = top_k_filter(logits, k)
    p = softmax(filtered)
    active = sum(1 for x in p if x > 0.001)
    print(f"k={k}: {active} active tokens")

# --- Section 4: Top-p ---
print("\n--- Top-p Filtering ---")
for p_val in [0.8, 0.95]:
    filtered, kept = top_p_filter(logits, p_val)
    print(f"p={p_val}: nucleus size = {len(kept)}")

# --- Section 5: Visualization ---
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
fig.suptitle("How Sampling Parameters Reshape Probabilities", fontsize=14)
configs_viz = [
    ("Default (T=1.0)", {"temperature": 1.0}),
    ("Low temp (T=0.3)", {"temperature": 0.3}),
    ("Top-k=3", {"temperature": 1.0, "top_k": 3}),
    ("Top-p=0.8", {"temperature": 1.0, "top_p": 0.8}),
]
for ax, (title, params) in zip(axes.flat, configs_viz):
    temp = params.get("temperature", 1.0)
    scaled = logits / temp
    if "top_k" in params:
        scaled = top_k_filter(scaled, params["top_k"])
    if "top_p" in params:
        scaled, _ = top_p_filter(scaled, params["top_p"])
    p = softmax(scaled)
    colors = ["steelblue" if x > 0.001 else "lightgray" for x in p]
    ax.bar(tokens, p, color=colors)
    ax.set_title(title)
    ax.set_ylim(0, 0.7)
    ax.set_ylabel("Probability")
plt.tight_layout()
plt.show()

print("\nScript completed successfully.")

References

Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). "The Curious Case of Neural Text Degeneration." ICLR 2020. Link
OpenAI API Reference — Chat Completions. Link
Fan, A., Lewis, M., & Dauphin, Y. (2018). "Hierarchical Neural Story Generation." ACL 2018. Link — Introduced top-k sampling.
Google Cloud — Generative AI: Configure model parameters. Link
Anthropic Documentation — Sampling parameters. Link
Hugging Face — How to generate text: using different decoding methods. Link
Chip Huyen (2024). "Generation configurations: temperature, top-k, top-p, and test time compute." Link
Let's Data Science — "LLM Sampling Parameters Explained: Intuition to Math." Link

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Gen AI — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

Temperature, Top-p & Top-k in LLMs Explained (Python)

What Happens Before Sampling: Logits and Softmax

Temperature: The Confidence Dial

What Does Temperature = 0 Mean?

Exercise 1: Build a Temperature Sweep

Top-k Filtering: Limiting the Candidate Pool

The Problem with Top-k: One Size Doesn't Fit All

Top-p (Nucleus Sampling): The Adaptive Cutoff

Top-k vs Top-p: When Does Each Win?

Exercise 2: Build a Nucleus Size Calculator

Combining Parameters: Temperature + Top-k + Top-p

Visualizing the Full Picture

The Sampling Playground: Experiment Yourself

Exercise 3: Complete Sampling Pipeline

When to Use Each Parameter: A Practical Guide

Common Mistakes and How to Fix Them

Mistake 1: Setting temperature to 0 and expecting variety

Mistake 2: Maxing out both temperature and top-p

Mistake 3: Adding parameters you haven't tested

Frequently Asked Questions

Can I use temperature and top-p together?

What's the difference between top-p and min-p?

Do these parameters affect reasoning quality?

What happens if I set top-k=1?

Summary

Complete Code

References

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

What Happens Before Sampling: Logits and Softmax

Temperature: The Confidence Dial

What Does Temperature = 0 Mean?

Exercise 1: Build a Temperature Sweep

Top-k Filtering: Limiting the Candidate Pool

The Problem with Top-k: One Size Doesn't Fit All

Top-p (Nucleus Sampling): The Adaptive Cutoff

Top-k vs Top-p: When Does Each Win?

Exercise 2: Build a Nucleus Size Calculator

Combining Parameters: Temperature + Top-k + Top-p

Visualizing the Full Picture

The Sampling Playground: Experiment Yourself

Exercise 3: Complete Sampling Pipeline

When to Use Each Parameter: A Practical Guide

Common Mistakes and How to Fix Them

Mistake 1: Setting temperature to 0 and expecting variety

Mistake 2: Maxing out both temperature and top-p

Mistake 3: Adding parameters you haven't tested

Frequently Asked Questions

Can I use temperature and top-p together?

What's the difference between top-p and min-p?

Do these parameters affect reasoning quality?

What happens if I set top-k=1?

Summary

Complete Code

References

Related Articles

LLM Temperature, Top-P, and Top-K Explained — With Python Simulations

OpenAI API Python Tutorial — Chat Completions, Streaming & Error Handling

Zero-Shot vs Few-Shot Prompting: Complete Guide

Get Your Free AI/ML Engineer Roadmap

Want help choosing the right AI/ML path?

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science