machine learning +
OpenAI API Python Tutorial – A Complete Crash Course
Temperature, Top-p & Top-k in LLMs Explained (Python)
Master LLM temperature, top-k, and top-p with interactive Python simulations. Runnable code, exercises, and a sampling playground to build real intuition.
This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.
Build interactive NumPy simulations that show how temperature, top-k, and nucleus sampling reshape a language model’s word choices.
Every time you call an LLM API, you pass parameters like temperature=0.7 or top_p=0.9. You’ve probably tweaked them, noticed vaguely different outputs, and moved on. But do you actually know what they do to the probability distribution?
Most people don’t. They’re tuning knobs blindfolded.
This article fixes that. We’ll build NumPy simulations that let you see — with real numbers — how each parameter reshapes next-token probabilities. No API keys needed. Everything runs in pure Python.
What Happens Before Sampling: Logits and Softmax
Before we touch any sampling parameter, you need to know where probabilities come from. I find this is the step most tutorials skip, and it causes confusion later.
An LLM doesn’t output probabilities directly. It outputs logits — raw scores, one per token in its vocabulary. A positive logit means “likely.” A negative logit means “unlikely.”
But logits aren’t probabilities. They don’t sum to 1.
The softmax function fixes that. It exponentiates each logit, then divides by the total. Larger logits get bigger probabilities. Smaller ones get squeezed toward zero.
Let’s build this from scratch. We’ll create a mock vocabulary of 8 tokens with hand-picked logits, then apply softmax. The softmax() function exponentiates each logit and normalizes so everything sums to 1.
import numpy as np
np.random.seed(42)
# Mock vocabulary: 8 tokens with raw logit scores
tokens = ["the", "a", "cat", "dog", "sat", "ran", "on", "happy"]
logits = np.array([2.0, 1.0, 1.5, 0.5, -0.5, -1.0, 0.3, -0.2])
def softmax(logits):
"""Convert raw logits to probabilities."""
exp_logits = np.exp(logits - np.max(logits))
return exp_logits / exp_logits.sum()
probs = softmax(logits)
print("Token Logit Probability")
print("-" * 38)
for token, logit, prob in zip(tokens, logits, probs):
print(f"{token:<12} {logit:>6.1f} {prob:.4f}")
print(f"\nSum of probabilities: {probs.sum():.4f}")
Output:
python
Token Logit Probability
--------------------------------------
the 2.0 0.3813
a 1.0 0.1403
cat 1.5 0.2313
dog 0.5 0.0851
sat -0.5 0.0313
ran -1.0 0.0190
on 0.3 0.0697
happy -0.2 0.0423
Sum of probabilities: 1.0000
“the” (logit 2.0) grabbed 38% of the probability. “ran” (logit -1.0) got just 1.9%. Softmax preserves the ranking but creates an exponential gap between high and low scorers.
KEY INSIGHT: Every sampling parameter works by modifying either the logits before softmax or the probabilities after softmax. Understanding softmax is the foundation for everything else in this article.
Temperature: The Confidence Dial
Here’s a question that trips up most people: what does temperature actually do?
It divides every logit by a single number before softmax runs. That’s it. The formula is:
\[p_i = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}\]Where:
– \(p_i\) = probability of token $i$
– \(z_i\) = the raw logit for token $i$
– $T$ = temperature (a positive number)
– The sum runs over all tokens in the vocabulary
If math isn’t your thing, skip to the code below — it tells the same story with real numbers.
When \(T = 1.0\), nothing changes. You get the default softmax. When \(T < 1.0\), you divide by a small number. That amplifies the gaps between logits. The top token dominates even more.
When \(T > 1.0\), you divide by a large number. That compresses logits toward zero. The distribution flattens, giving low-probability tokens a better chance.
Quick check: Before you run the next block, predict this. If we set \(T=0.5\), will "the" get more or less than its original 38.1%?
We'll apply softmax at three temperatures — 0.5 (sharp), 1.0 (default), and 2.0 (flat). The softmax_with_temperature() function divides logits by $T$ before calling softmax.
def softmax_with_temperature(logits, temperature):
"""Apply temperature scaling before softmax."""
scaled = logits / temperature
return softmax(scaled)
temperatures = [0.5, 1.0, 2.0]
print(f"{'Token':<10}", end="")
for t in temperatures:
print(f" T={t:<4}", end="")
print()
print("-" * 44)
for i, token in enumerate(tokens):
print(f"{token:<10}", end="")
for t in temperatures:
p = softmax_with_temperature(logits, t)
print(f" {p[i]:.4f}", end="")
print()
Output:
python
Token T=0.5 T=1.0 T=2.0
--------------------------------------------
the 0.6220 0.3813 0.2423
a 0.0842 0.1403 0.1470
cat 0.2287 0.2313 0.1887
dog 0.0310 0.0851 0.1145
sat 0.0042 0.0313 0.0694
ran 0.0015 0.0190 0.0541
on 0.0208 0.0697 0.1036
happy 0.0076 0.0423 0.0807
If you predicted "more" — you're right. At \(T=0.5\), "the" jumps from 38.1% to 62.2%. Meanwhile, "ran" shrinks from 1.9% to 0.15%. That's a 42x difference between high and low temperature for a single token.
That's the tradeoff. Low temperature sharpens. High temperature flattens.
TIP: Use
temperature=0.0to0.3for factual Q&A and code generation. Use0.7to1.0for creative writing. Going above1.5is rarely useful — the output gets incoherent.
What Does Temperature = 0 Mean?
At \(T = 0\), the model always picks the highest-logit token. No randomness at all. This is called greedy decoding. APIs handle \(T=0\) as a special case — they skip sampling and return the argmax.
def greedy_decode(logits, tokens):
"""Temperature = 0: always pick the highest logit."""
best_idx = np.argmax(logits)
return tokens[best_idx], logits[best_idx]
winner, score = greedy_decode(logits, tokens)
print(f"Greedy pick: '{winner}' (logit: {score:.1f})")
python
Greedy pick: 'the' (logit: 2.0)
No surprises. "the" had the highest logit. Greedy decoding always picks it. Every single time. Zero variety.
Now that you understand how temperature reshapes the distribution, let's practice.
Exercise 1: Build a Temperature Sweep
{
type: 'exercise',
id: 'temp-sweep-ex1',
title: 'Exercise 1: Build a Temperature Sweep',
difficulty: 'intermediate',
exerciseType: 'write',
instructions: 'Write a function `temperature_sweep(logits, temps)` that takes a logit array and a list of temperatures. For each temperature, compute the softmax probabilities and return the probability of the *highest-logit* token at each temperature. Return a list of floats.\n\nFor example, with logits `[2.0, 1.0, 0.5]` and temps `[0.5, 1.0, 2.0]`, the highest-logit token is index 0. Return its probability at each temperature.',
starterCode: 'def temperature_sweep(logits, temps):\n """Return the probability of the top token at each temperature."""\n top_idx = np.argmax(logits)\n result = []\n for t in temps:\n # Apply temperature scaling and softmax\n # Append probability of top_idx to result\n pass\n return result\n\n# Test\nlogits_test = np.array([2.0, 1.0, 0.5])\ntemps_test = [0.5, 1.0, 2.0]\nprint(temperature_sweep(logits_test, temps_test))',
testCases: [
{ id: 'tc1', input: 'logits_test = np.array([2.0, 1.0, 0.5])\nprint([round(x, 4) for x in temperature_sweep(logits_test, [0.5, 1.0, 2.0])])', expectedOutput: '[0.8438, 0.6590, 0.5066]', description: 'Standard test with three temperatures' },
{ id: 'tc2', input: 'logits_test = np.array([3.0, 3.0, 3.0])\nprint([round(x, 4) for x in temperature_sweep(logits_test, [0.5, 1.0])])', expectedOutput: '[0.3333, 0.3333]', description: 'Equal logits — temperature should not matter' },
],
hints: [
'Inside the loop, compute scaled_logits = logits / t, then apply softmax to get probabilities.',
'Full line: probs = softmax(logits / t); result.append(probs[top_idx])',
],
solution: 'def temperature_sweep(logits, temps):\n top_idx = np.argmax(logits)\n result = []\n for t in temps:\n probs = softmax(logits / t)\n result.append(probs[top_idx])\n return result',
solutionExplanation: 'For each temperature, divide the logits by T, apply softmax, and grab the probability at the index of the highest original logit. When logits are equal, softmax returns uniform probabilities regardless of temperature — the top token always gets 1/n.',
xpReward: 15,
}
Top-k Filtering: Limiting the Candidate Pool
Temperature adjusts how spread out the probabilities are. But even at low temperature, the model still considers every token in its vocabulary.
Why is that a problem? An LLM's vocabulary is typically 32,000 to 128,000 tokens. Even with low temperature, there's a long tail with tiny probabilities. On rare occasions, the model samples from that tail and produces nonsense.
Top-k filtering takes a different approach. It keeps only the $k$ most probable tokens and zeroes out everything else. Then it renormalizes so the survivors sum to 1.
The top_k_filter() function finds the $k$ largest logits. Everything else gets set to negative infinity, which softmax converts to zero probability.
def top_k_filter(logits, k):
"""Keep only the top-k logits; set the rest to -inf."""
top_k_indices = np.argsort(logits)[-k:]
filtered = np.full_like(logits, -np.inf)
filtered[top_k_indices] = logits[top_k_indices]
return filtered
for k in [3, 5]:
filtered = top_k_filter(logits, k)
probs = softmax(filtered)
print(f"\nTop-{k} filtering:")
print(f"{'Token':<10} {'Logit':>8} {'Prob':>8}")
print("-" * 28)
for token, logit, prob in zip(tokens, filtered, probs):
if prob > 0:
print(f"{token:<10} {logit:>8.1f} {prob:>8.4f}")
Output:
python
Top-3 filtering:
Token Logit Prob
----------------------------
the 2.0 0.5066
a 1.0 0.1863
cat 1.5 0.3072
Top-5 filtering:
Token Logit Prob
----------------------------
the 2.0 0.4202
a 1.0 0.1546
cat 1.5 0.2549
dog 0.5 0.0937
on 0.3 0.0768
With \(k=3\), only "the", "cat", and "a" survive. "the" jumps from 38.1% to 50.7% because it no longer shares mass with five eliminated tokens.
WARNING: Setting
\(k$ too low (like 1 or 2) makes output extremely repetitive. Setting it too high (like 1000) has almost no effect. A typical value is \)k=40$ to $50$.
The Problem with Top-k: One Size Doesn't Fit All
Here's a subtlety most tutorials skip. Top-k uses a fixed cutoff regardless of context.
Sometimes the model is confident — one token has 90% probability. Even \(k=5\) is too generous, including tokens the model doesn't want.
Other times, the model is uncertain. The top 20 tokens each have about 5%. With \(k=5\), you'd cut out 15 perfectly valid options.
Top-k can't adapt. That's exactly what top-p solves.
Top-p (Nucleus Sampling): The Adaptive Cutoff
Top-p, also called nucleus sampling, was introduced by Holtzman et al. (2019). It doesn't keep a fixed count of tokens. Instead, it keeps the smallest set whose cumulative probability reaches a threshold $p$.
Here's how it works. Sort tokens by probability, highest to lowest. Walk down the list, adding up probabilities. When the running total crosses $p$, stop. Everything below gets eliminated.
This adapts automatically. When the model is confident, the nucleus is tiny — maybe 2-3 tokens. When uncertain, it grows to include many options.
The top_p_filter() function sorts tokens by probability and computes a running total. np.searchsorted finds where the total first reaches $p$. Everything past that point becomes negative infinity.
def top_p_filter(logits, p):
"""Keep the smallest set of tokens with cumulative prob >= p."""
probs = softmax(logits)
sorted_indices = np.argsort(probs)[::-1]
sorted_probs = probs[sorted_indices]
cumulative = np.cumsum(sorted_probs)
cutoff_idx = np.searchsorted(cumulative, p) + 1
cutoff_idx = min(cutoff_idx, len(probs))
kept_indices = sorted_indices[:cutoff_idx]
filtered = np.full_like(logits, -np.inf)
filtered[kept_indices] = logits[kept_indices]
return filtered, kept_indices
for p_val in [0.8, 0.95]:
filtered, kept = top_p_filter(logits, p_val)
probs_filtered = softmax(filtered)
print(f"\nTop-p = {p_val} (nucleus size: {len(kept)} tokens)")
print(f"{'Token':<10} {'Prob':>8}")
print("-" * 20)
for idx in kept:
print(f"{tokens[idx]:<10} {probs_filtered[idx]:>8.4f}")
Output:
python
Top-p = 0.8 (nucleus size: 4 tokens)
Token Prob
--------------------
the 0.4551
cat 0.2760
a 0.1674
dog 0.1015
Top-p = 0.95 (nucleus size: 6 tokens)
Token Prob
--------------------
the 0.4014
cat 0.2435
a 0.1477
dog 0.0896
on 0.0733
happy 0.0445
With \(p=0.8\), four tokens make the cut. Cumulative: "the"(0.38) + "cat"(0.23) + "a"(0.14) = 0.75. That's under 0.80. Adding "dog"(0.09) pushes it to 0.84 — past the threshold.
At \(p=0.95\), the nucleus grows to 6 tokens. The distribution needs more options to cover 95% of the mass.
KEY INSIGHT: Top-p adapts to the model's confidence. When one token dominates, the nucleus shrinks. When the model is uncertain, it grows. This is why top-p generally produces more natural text than a fixed top-k.
Top-k vs Top-p: When Does Each Win?
| Scenario | Top-k | Top-p | Winner |
|---|---|---|---|
| Model is confident (one token dominates) | Keeps k tokens even though most are unwanted | Shrinks to 2-3 tokens automatically | Top-p |
| Model is uncertain (flat distribution) | Cuts valid options if k is too small | Grows to include all reasonable tokens | Top-p |
| You need a hard ceiling on candidates | Guarantees at most k tokens | Nucleus size varies per step | Top-k |
| Computational simplicity | Just sort and slice | Requires cumulative sum + threshold | Top-k |
In practice, top-p is the better default for most tasks. Use top-k as a safety ceiling on top of it.
Predict this: If you applied temperature=0.5 first (which sharpens the distribution), then top-p=0.9 — would the nucleus be bigger or smaller than at temperature=1.0?
Smaller. A sharper distribution means fewer tokens cover 90% of the mass.
Exercise 2: Build a Nucleus Size Calculator
{
type: 'exercise',
id: 'nucleus-size-ex2',
title: 'Exercise 2: Nucleus Size at Different Temperatures',
difficulty: 'intermediate',
exerciseType: 'write',
instructions: 'Write a function `nucleus_size(logits, temperature, p)` that returns how many tokens are in the nucleus (top-p set) for a given temperature and p threshold.\n\nApply temperature scaling first, compute softmax probabilities, sort descending, compute cumulative sum, and count how many tokens it takes to reach the cumulative threshold p.',
starterCode: 'def nucleus_size(logits, temperature, p):\n """Count tokens in the nucleus after temperature scaling."""\n scaled = logits / temperature\n probs = softmax(scaled)\n sorted_probs = np.sort(probs)[::-1]\n cumulative = 0.0\n count = 0\n # YOUR CODE HERE: loop through sorted_probs\n return count\n\n# Test\ntest_logits = np.array([2.0, 1.0, 1.5, 0.5, -0.5, -1.0, 0.3, -0.2])\nprint(nucleus_size(test_logits, 0.5, 0.9))\nprint(nucleus_size(test_logits, 2.0, 0.9))',
testCases: [
{ id: 'tc1', input: 'test_logits = np.array([2.0, 1.0, 1.5, 0.5, -0.5, -1.0, 0.3, -0.2])\nprint(nucleus_size(test_logits, 0.5, 0.9))', expectedOutput: '3', description: 'Low temperature — nucleus should be small' },
{ id: 'tc2', input: 'test_logits = np.array([2.0, 1.0, 1.5, 0.5, -0.5, -1.0, 0.3, -0.2])\nprint(nucleus_size(test_logits, 2.0, 0.9))', expectedOutput: '7', description: 'High temperature — nucleus should be larger' },
],
hints: [
'Use a for loop: for prob in sorted_probs: cumulative += prob; count += 1; if cumulative >= p: break',
'Full solution body: for prob in sorted_probs:\n cumulative += prob\n count += 1\n if cumulative >= p:\n break\nreturn count',
],
solution: 'def nucleus_size(logits, temperature, p):\n scaled = logits / temperature\n probs = softmax(scaled)\n sorted_probs = np.sort(probs)[::-1]\n cumulative = 0.0\n count = 0\n for prob in sorted_probs:\n cumulative += prob\n count += 1\n if cumulative >= p:\n break\n return count',
solutionExplanation: 'Temperature scaling changes how peaked the distribution is. At low temperature (0.5), the top tokens grab most of the mass, so fewer tokens reach 90%. At high temperature (2.0), the distribution flattens, requiring more tokens to accumulate 90% of probability.',
xpReward: 15,
}
Combining Parameters: Temperature + Top-k + Top-p
In most API calls, you can set all three at once. The order of operations matters — and varies slightly by provider. The most common convention is:
- Temperature is applied first — reshapes logits before softmax
- Top-k is applied next — removes low-ranking tokens
- Top-p is applied last — trims to a cumulative threshold
The sample_with_params() function chains all three steps, then picks one token. We'll run 1000 samples with different combos to see how diversity changes.
def sample_with_params(logits, tokens, temperature=1.0, top_k=None, top_p=None):
"""Full sampling pipeline: temperature -> top-k -> top-p -> sample."""
scaled_logits = logits / temperature if temperature > 0 else logits
if top_k is not None:
scaled_logits = top_k_filter(scaled_logits, top_k)
if top_p is not None:
scaled_logits, _ = top_p_filter(scaled_logits, top_p)
final_probs = softmax(scaled_logits)
chosen_idx = np.random.choice(len(tokens), p=final_probs)
return tokens[chosen_idx], final_probs
np.random.seed(42)
configs = [
{"temperature": 0.3, "top_k": None, "top_p": None},
{"temperature": 1.0, "top_k": 5, "top_p": None},
{"temperature": 0.7, "top_k": None, "top_p": 0.9},
{"temperature": 0.7, "top_k": 5, "top_p": 0.9},
]
for config in configs:
counts = {}
for _ in range(1000):
token, _ = sample_with_params(logits, tokens, **config)
counts[token] = counts.get(token, 0) + 1
label = f"T={config['temperature']}"
if config['top_k']:
label += f", k={config['top_k']}"
if config['top_p']:
label += f", p={config['top_p']}"
sorted_counts = sorted(counts.items(), key=lambda x: -x[1])
top3 = ", ".join(f"{t}: {c/10:.1f}%" for t, c in sorted_counts[:3])
unique = len(counts)
print(f"{label:<20} -> {unique} unique | Top 3: {top3}")
The pattern is clear. Low temperature concentrates mass on the top token. Adding top-k or top-p narrows the pool further. Combining all three gives you the most controlled output.
Visualizing the Full Picture
A side-by-side bar chart makes the differences jump out. Each subplot shows each token's probability under a different configuration. Blue bars are active tokens; gray bars were eliminated.
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
fig.suptitle("How Sampling Parameters Reshape Probabilities", fontsize=14)
configs_viz = [
("Default (T=1.0)", {"temperature": 1.0}),
("Low temp (T=0.3)", {"temperature": 0.3}),
("Top-k=3", {"temperature": 1.0, "top_k": 3}),
("Top-p=0.8", {"temperature": 1.0, "top_p": 0.8}),
]
for ax, (title, params) in zip(axes.flat, configs_viz):
temp = params.get("temperature", 1.0)
scaled = logits / temp
if "top_k" in params:
scaled = top_k_filter(scaled, params["top_k"])
if "top_p" in params:
scaled, _ = top_p_filter(scaled, params["top_p"])
probs_viz = softmax(scaled)
colors = ["steelblue" if p > 0.001 else "lightgray" for p in probs_viz]
ax.bar(tokens, probs_viz, color=colors)
ax.set_title(title)
ax.set_ylim(0, 0.7)
ax.set_ylabel("Probability")
plt.tight_layout()
plt.show()
Default softmax spreads probability across all 8 tokens. Low temperature creates a sharp spike on "the". Top-k=3 keeps only three bars. Top-p=0.8 keeps four bars, adapting to the actual distribution shape.
TIP: Top-p and top-k behave differently when distributions change. With a flat distribution (uncertain model), top-p keeps more tokens. With a peaked one (confident model), top-p keeps fewer. That's its advantage.
The Sampling Playground: Experiment Yourself
Here's the fun part. The function below runs a full simulation. Set temperature, top-k, and top-p. It shows surviving tokens with an ASCII bar chart, then draws 20 random samples.
The sampling_playground() chains all three operations, prints the probability table, then samples tokens so you can see the variety at each setting.
def sampling_playground(logits, tokens, temperature=1.0, top_k=None, top_p=None, n_samples=20):
"""Interactive playground: see distribution + sample tokens."""
print(f"=== Settings: T={temperature}", end="")
if top_k: print(f", k={top_k}", end="")
if top_p: print(f", p={top_p}", end="")
print(" ===\n")
scaled = logits / temperature
if top_k:
scaled = top_k_filter(scaled, top_k)
if top_p:
scaled, _ = top_p_filter(scaled, top_p)
probs = softmax(scaled)
active = [(tokens[i], probs[i]) for i in range(len(tokens)) if probs[i] > 0.001]
print(f"Active tokens: {len(active)}")
for t, p in sorted(active, key=lambda x: -x[1]):
bar = "█" * int(p * 50)
print(f" {t:<10} {p:.4f} {bar}")
samples = [tokens[np.random.choice(len(tokens), p=probs)] for _ in range(n_samples)]
print(f"\n{n_samples} samples: {', '.join(samples)}\n")
np.random.seed(42)
sampling_playground(logits, tokens, temperature=0.5, top_k=5, top_p=0.9)
sampling_playground(logits, tokens, temperature=1.5, top_k=None, top_p=0.95)
The contrast is dramatic. The conservative setting produces repetitive output dominated by "the". The liberal setting keeps many tokens active and produces varied sequences.
This is exactly the kind of experimentation you should do before tuning API parameters. Build a toy simulation first. See the diversity. Then decide what fits.
Exercise 3: Complete Sampling Pipeline
{
type: 'exercise',
id: 'full-pipeline-ex3',
title: 'Exercise 3: Sample With Full Pipeline',
difficulty: 'intermediate',
exerciseType: 'write',
instructions: 'Write a function `count_unique_tokens(logits, tokens, temperature, top_k, top_p, n_samples)` that runs the full sampling pipeline n_samples times and returns the number of unique tokens sampled.\n\nUse the helper functions already defined: `softmax()`, `top_k_filter()`, and `top_p_filter()`.',
starterCode: 'def count_unique_tokens(logits, tokens, temperature=1.0, top_k=None, top_p=None, n_samples=500):\n """Run sampling n_samples times, return count of unique tokens."""\n np.random.seed(0)\n seen = set()\n for _ in range(n_samples):\n scaled = logits / temperature\n # Apply top-k if specified\n # Apply top-p if specified\n # Compute final probs and sample\n # Add sampled token to seen\n pass\n return len(seen)\n\ntokens_test = ["the", "a", "cat", "dog", "sat", "ran", "on", "happy"]\nlogits_test = np.array([2.0, 1.0, 1.5, 0.5, -0.5, -1.0, 0.3, -0.2])\nprint(count_unique_tokens(logits_test, tokens_test, temperature=0.3, top_k=3))',
testCases: [
{ id: 'tc1', input: 'np.random.seed(0)\nprint(count_unique_tokens(logits_test, tokens_test, temperature=0.3, top_k=3))', expectedOutput: '3', description: 'With top_k=3, at most 3 unique tokens can appear' },
{ id: 'tc2', input: 'np.random.seed(0)\nprint(count_unique_tokens(logits_test, tokens_test, temperature=2.0, top_p=0.99))', expectedOutput: '8', description: 'High temp + high top_p should produce all 8 tokens' },
],
hints: [
'After scaling, use: if top_k: scaled = top_k_filter(scaled, top_k). Same pattern for top_p.',
'After filtering: probs = softmax(scaled); idx = np.random.choice(len(tokens), p=probs); seen.add(tokens[idx])',
],
solution: 'def count_unique_tokens(logits, tokens, temperature=1.0, top_k=None, top_p=None, n_samples=500):\n np.random.seed(0)\n seen = set()\n for _ in range(n_samples):\n scaled = logits / temperature\n if top_k:\n scaled = top_k_filter(scaled, top_k)\n if top_p:\n scaled, _ = top_p_filter(scaled, top_p)\n probs = softmax(scaled)\n idx = np.random.choice(len(tokens), p=probs)\n seen.add(tokens[idx])\n return len(seen)',
solutionExplanation: 'The pipeline chains temperature -> top-k -> top-p -> sample. With aggressive filtering (low temp + small k), only a handful of tokens can ever be sampled. With permissive settings (high temp + high p), the full vocabulary opens up.',
xpReward: 20,
}
When to Use Each Parameter: A Practical Guide
I've seen developers default to temperature=0.7 for everything. That works surprisingly often, but you're leaving control on the table.
Temperature is your primary dial. It controls overall creativity vs. determinism.
Top-k is a safety net. It stops the model from sampling extreme tail tokens. I use \(k=40\) as a default and rarely change it.
Top-p is the smart filter. It adapts to the model's confidence automatically. For most tasks, \(p=0.9\) is a solid starting point.
| Task | Temperature | Top-k | Top-p | Why |
|---|---|---|---|---|
| Code generation | 0.0 -- 0.2 | 10--20 | 0.8 | You want correctness, not creativity |
| Factual Q&A | 0.0 -- 0.3 | 20--40 | 0.85 | Accuracy matters most |
| Summarization | 0.3 -- 0.5 | 40 | 0.9 | Slight variation is fine |
| Creative writing | 0.7 -- 1.0 | 50--100 | 0.95 | Diverse word choices improve quality |
| Brainstorming | 1.0 -- 1.5 | 100+ | 0.98 | You want wild, unexpected ideas |
WARNING: Don't set temperature high AND top-p close to 1.0 simultaneously. That produces chaotic output. Raise one, not both. Doing both at once usually hurts.
Common Mistakes and How to Fix Them
Mistake 1: Setting temperature to 0 and expecting variety
Wrong:
# "Why does the model always give the same answer?"
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a haiku"}],
temperature=0.0 # greedy decoding — ZERO randomness
)
Why it's wrong: Temperature 0 means greedy decoding. Same input, same output. Always.
Fix: Use temperature=0.8 or higher for creative tasks.
Mistake 2: Maxing out both temperature and top-p
Wrong:
# Double randomness — chaotic output
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Summarize this"}],
temperature=1.5,
top_p=0.99
)
Why it's wrong: High temperature flattens the distribution. High top-p keeps nearly all tokens. The model samples almost uniformly. Output reads like word salad.
Fix: Raise one, not both. Try temperature=0.9, top_p=0.9.
Mistake 3: Adding parameters you haven't tested
Some providers set top-p to 1.0 by default. That means it has no effect. If temperature alone gives good results, don't pile on top-k and top-p "just because." Every parameter adds complexity. Add it only when testing confirms it helps.
Frequently Asked Questions
Can I use temperature and top-p together?
Yes. Most APIs accept both. Temperature reshapes the distribution first, then top-p trims it. I recommend temperature as your primary control. Only tweak top-p when you need finer tail behavior.
What's the difference between top-p and min-p?
Min-p is a newer method. It sets a minimum probability threshold relative to the top token. Say the top token has probability 0.8 and you set min-p=0.1. Only tokens above 0.08 (10% of 0.8) survive. It doesn't need sorting, so it's faster. Several open-source engines support it.
Do these parameters affect reasoning quality?
Indirectly, yes. For chain-of-thought reasoning, lower temperature (0.0--0.3) produces more consistent logic. Higher temperatures can cause the model to lose its reasoning thread. For math, code, or structured outputs, keep temperature low.
What happens if I set top-k=1?
That's greedy decoding. You always pick the single most probable token. Same effect as temperature=0, though the mechanism differs. Output is completely deterministic.
Summary
Sampling parameters control the tradeoff between predictability and diversity. Here's the core mental model:
- Temperature reshapes logits before softmax. Low sharpens, high flattens.
- Top-k keeps the $k$ highest-probability tokens. Simple but rigid.
- Top-p keeps the smallest set covering cumulative probability $p$. It adapts to the model's confidence.
- In practice, use temperature as your primary knob. Add top-p for fine-tuning. Use top-k as a safety net.
Practice exercise: Build a function that finds the "crossover temperature" — the $T$ where the second-most-likely token's probability exceeds 25%. This tells you how much temperature it takes to make the model genuinely uncertain.
Complete Code
References
- Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). "The Curious Case of Neural Text Degeneration." ICLR 2020. Link
- OpenAI API Reference — Chat Completions. Link
- Fan, A., Lewis, M., & Dauphin, Y. (2018). "Hierarchical Neural Story Generation." ACL 2018. Link — Introduced top-k sampling.
- Google Cloud — Generative AI: Configure model parameters. Link
- Anthropic Documentation — Sampling parameters. Link
- Hugging Face — How to generate text: using different decoding methods. Link
- Chip Huyen (2024). "Generation configurations: temperature, top-k, top-p, and test time compute." Link
- Let's Data Science — "LLM Sampling Parameters Explained: Intuition to Math." Link
Free Course
Master Core Python — Your First Step into AI/ML
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
