machine learning +
Speculative Decoding: Faster LLM Inference (Python)
Mixture of Experts (MoE): Build a Router in Python
Build a Mixture of Experts (MoE) layer in Python with NumPy. Router, top-k gating, load balancing, and expert networks — with runnable code and exercises.
This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser.
Build a working MoE layer with top-k gating, expert networks, and load balancing — all in pure NumPy.
GPT-4 has over a trillion parameters. Mixtral 8x7B has 47 billion. Yet Mixtral runs nearly as fast as a 13-billion-parameter model. How?
The secret is Mixture of Experts. Instead of running every token through all 47 billion parameters, a tiny router picks just 2 out of 8 experts per token. Each expert is a full feed-forward network, but only the selected ones activate. You get the capacity of a huge model at the cost of a small one.
That’s what you’ll build here. By the end, you’ll have a working MoE layer in NumPy — router, experts, top-k gating, and load balancing loss.
Here’s how the pieces fit together. Token embeddings enter the router — a small linear layer that scores each token against each expert. Softmax converts those scores into gating weights. Top-k selection picks the best 2 experts per token. Each selected expert processes its assigned tokens through a feed-forward network. The outputs get multiplied by gating weights and summed. A load balancing loss nudges the router to spread tokens evenly — without it, one expert hogs all the traffic.
Prerequisites
- Python version: 3.9+
- Required libraries: NumPy (1.24+)
- Install:
pip install numpy - Time to complete: 25-30 minutes
What Is Mixture of Experts?
Mixture of Experts is an architecture where multiple specialist sub-networks share a single model. A gating network decides which experts handle each input.
Think of it like a hospital. You don’t send every patient to every doctor. A triage nurse (the router) evaluates each patient and sends them to the right specialist. The hospital has the combined knowledge of all doctors, but each patient only uses one or two.
In neural networks, each “expert” is a feed-forward network. The router is a learned linear layer that maps each token to a probability distribution over experts.
| Component | Role | Analogy |
|---|---|---|
| Router/Gate | Decides which experts handle each token | Triage nurse |
| Expert | Processes tokens through a feed-forward network | Specialist doctor |
| Top-k Selection | Picks the best k experts per token | “See these 2 doctors” |
| Load Balancing | Ensures all experts get used evenly | Fair patient distribution |
We’ll create the simplest possible MoE — one router, four experts, and top-2 gating. The first code block sets up the configuration and creates random token embeddings. In a real transformer, these tokens would come from an attention layer. Here we use random vectors.
import numpy as np
np.random.seed(42)
# MoE configuration
d_model = 8 # embedding dimension
num_experts = 4 # number of expert networks
d_ff = 16 # feed-forward hidden dimension
top_k = 2 # experts selected per token
batch_size = 6 # number of tokens
# Random token embeddings (in real models, these come from attention layers)
tokens = np.random.randn(batch_size, d_model)
print(f"Tokens shape: {tokens.shape}")
print(f"Config: {num_experts} experts, top-{top_k} gating, d_model={d_model}")
python
Tokens shape: (6, 8)
Config: 4 experts, top-2 gating, d_model=8
Six tokens, each with 8 dimensions. Four experts are waiting. The router will decide which 2 experts each token visits.
How the Router Works
The router is the brain of MoE. It’s surprisingly simple — a single linear layer followed by a softmax.
For each token, the router produces a score for every expert. Softmax converts those scores into probabilities. The highest probabilities determine which experts get activated.
Here’s the math. Given a token embedding \(x$ and a router weight matrix \)W_g$ of shape \((d_{model}, n_{experts})\):
\[g(x) = \text{softmax}(x \cdot W_g)\]Where:
– \(g(x)\) = gating probabilities (one per expert, sums to 1)
– $x$ = token embedding vector
– \(W_g\) = learnable router weights
If you’re not interested in the math, skip ahead — the practical code below is all you need.
In code, this is one matrix multiply and a softmax. The W_gate matrix maps each token from d_model dimensions to num_experts scores. Softmax normalizes those into probabilities.
def softmax(x, axis=-1):
"""Numerically stable softmax."""
x_max = np.max(x, axis=axis, keepdims=True)
exp_x = np.exp(x - x_max)
return exp_x / np.sum(exp_x, axis=axis, keepdims=True)
# Router: single linear layer
W_gate = np.random.randn(d_model, num_experts) * 0.1
# Compute gating scores for all tokens
gate_logits = tokens @ W_gate # (batch_size, num_experts)
gate_probs = softmax(gate_logits) # (batch_size, num_experts)
print("Gate probabilities (each row sums to 1):")
print(np.round(gate_probs, 3))
print(f"\nRow sums: {np.round(gate_probs.sum(axis=1), 4)}")
You’ll see a (6, 4) matrix where each row sums to 1.0. The probabilities are close to uniform because the router isn’t trained — it’s random projections. The next step picks hard winners from these soft preferences.
Key Insight: The router is just a learned linear projection. A single matrix multiply followed by softmax decides the entire expert assignment. That’s what makes MoE routing cheap.
Top-k Gating: Picking the Winners
Why pick only k experts instead of using all of them? If you send every token to every expert, you’ve built a standard dense model. No savings at all. Top-k selection is what gives MoE its speed advantage.
Here’s how top-k gating works. For each token, find the indices of the k highest probabilities. Zero out everything else. Then renormalize the remaining weights so they sum to 1. Without renormalization, the output would be scaled down.
The code uses np.argsort to sort expert indices by probability. We take the last top_k indices (highest scores), build a mask, then divide by the new sum.
def top_k_gating(gate_probs, k):
"""Select top-k experts per token and renormalize weights."""
batch_size, num_experts = gate_probs.shape
# Find top-k expert indices per token
top_k_indices = np.argsort(gate_probs, axis=1)[:, -k:]
# Build a mask: 1 for selected experts, 0 for the rest
mask = np.zeros_like(gate_probs)
for i in range(batch_size):
mask[i, top_k_indices[i]] = 1.0
# Zero out non-selected experts and renormalize
gated_probs = gate_probs * mask
gated_probs = gated_probs / gated_probs.sum(axis=1, keepdims=True)
return gated_probs, top_k_indices
gated_probs, selected_experts = top_k_gating(gate_probs, top_k)
print("Selected experts per token:")
for i in range(batch_size):
experts_sel = selected_experts[i]
weights = [gated_probs[i, e] for e in experts_sel]
print(f" Token {i}: experts {experts_sel} "
f"with weights {[f'{w:.3f}' for w in weights]}")
Each token now has exactly 2 experts assigned with renormalized weights summing to 1. The gating weights control how much each expert contributes to the final output.
Quick check — can you spot a potential problem? Some experts might appear frequently while others sit idle. We’ll fix this with load balancing later.
Building Expert Networks
Each expert is a standard feed-forward network: linear layer, ReLU, second linear layer. This is the same structure inside a transformer block. The only difference is that MoE has multiple copies, and each token visits a subset.
We’ll initialize four experts. Each has two weight matrices: W1 maps from d_model to d_ff (expanding the representation), and W2 maps back from d_ff to d_model. ReLU sits between them. We use He initialization (* np.sqrt(2.0 / fan_in)) because it prevents gradient issues in ReLU networks.
def relu(x):
"""ReLU activation: max(0, x)."""
return np.maximum(0, x)
def init_expert(d_model, d_ff):
"""Initialize one expert's weights (two linear layers)."""
return {
'W1': np.random.randn(d_model, d_ff) * np.sqrt(2.0 / d_model),
'W2': np.random.randn(d_ff, d_model) * np.sqrt(2.0 / d_ff),
}
def expert_forward(x, expert_params):
"""Forward pass: Linear -> ReLU -> Linear."""
hidden = relu(x @ expert_params['W1'])
output = hidden @ expert_params['W2']
return output
# Create all experts
experts = [init_expert(d_model, d_ff) for _ in range(num_experts)]
# Test one expert with a single token
test_output = expert_forward(tokens[0:1], experts[0])
print(f"Input shape: {tokens[0:1].shape}")
print(f"Output shape: {test_output.shape}")
python
Input shape: (1, 8)
Output shape: (1, 8)
The expert takes in a (1, 8) token and returns a (1, 8) output. Dimensionality stays the same — d_model in, d_model out. Internally it expands to d_ff=16 dimensions before compressing back. This expansion-compression pattern adds representational capacity.
Tip: For GELU or SiLU activations (used in Llama, Mistral), switch to Xavier initialization. He initialization is specifically designed for ReLU. Using the wrong scheme causes training instability.
The MoE Forward Pass
This is where everything connects. The forward pass combines the router, top-k gating, and expert networks into one function.
Here’s the data flow:
- Router scores each token against all experts
- Top-k gating picks the best 2 experts per token
- Each selected expert processes its assigned tokens
- Outputs are scaled by gating weights and summed
The implementation loops over experts. For each one, it finds which tokens selected that expert, runs them through the network, scales by gating weight, and accumulates. This approach is cleaner than looping over tokens because each expert processes a batch at once.
def moe_forward(tokens, W_gate, experts, top_k):
"""Full MoE forward pass: route, gate, compute, combine."""
batch_size, d_model = tokens.shape
# Step 1: Router produces gating probabilities
gate_logits = tokens @ W_gate
gate_probs = softmax(gate_logits)
# Step 2: Top-k selection
gated_probs, top_k_indices = top_k_gating(gate_probs, top_k)
# Step 3 & 4: Run selected experts and combine
output = np.zeros_like(tokens)
for expert_idx in range(len(experts)):
token_mask = gated_probs[:, expert_idx] > 0
if not token_mask.any():
continue
expert_tokens = tokens[token_mask]
expert_weights = gated_probs[token_mask, expert_idx]
expert_out = expert_forward(expert_tokens, experts[expert_idx])
weighted_out = expert_out * expert_weights[:, np.newaxis]
output[token_mask] += weighted_out
return output, gate_probs, gated_probs
# Run the MoE layer
moe_output, raw_probs, final_probs = moe_forward(
tokens, W_gate, experts, top_k
)
print(f"Input shape: {tokens.shape}")
print(f"Output shape: {moe_output.shape}")
python
Input shape: (6, 8)
Output shape: (6, 8)
The shape is preserved — (6, 8) in, (6, 8) out. But every token was processed by its own pair of experts. Each got a personalized computation through a different expert combination.
What I find elegant about this: the router learns (during training) which expert handles which “type” of token. In a language model, one expert might specialize in syntax, another in facts, another in code. The model discovers this on its own.
The Load Balancing Problem
Without load balancing, MoE training collapses. Here’s why.
Early in training, the router assigns tokens randomly. Some experts get lucky — they receive slightly better tokens and improve faster. The router notices and sends more tokens their way. This creates a vicious cycle. Within a few hundred steps, one expert handles 90% of tokens while the others sit idle.
This is called expert collapse. It’s the biggest failure mode in MoE training.
The fix is a load balancing loss. Google’s Switch Transformer paper defined the standard formula [2]. For each expert, track two things: what fraction of tokens it receives (\(f_i\)), and what fraction of the total gating probability it gets (\(P_i\)).
\[L_{balance} = \alpha \cdot N \cdot \sum_{i=1}^{N} f_i \cdot P_i\]Where:
– $N$ = number of experts
– \(f_i\) = fraction of tokens routed to expert $i$
– \(P_i\) = mean gating probability assigned to expert $i$
– \(\alpha\) = balancing coefficient (typically 0.01-0.1)
Why multiply \(f_i\) and \(P_i\)? An expert with many tokens AND high probability is hogging resources. The loss pushes the router to spread the load.
The code below computes both fractions. alpha=0.01 is a safe starting value — typical production values range from 0.01 to 0.1.
def load_balancing_loss(gate_probs, gated_probs, num_experts, alpha=0.01):
"""Compute auxiliary load balancing loss."""
batch_size = gate_probs.shape[0]
# f_i: fraction of tokens routed to each expert
tokens_per_expert = (gated_probs > 0).astype(float).sum(axis=0)
f = tokens_per_expert / batch_size
# P_i: mean gating probability per expert
P = gate_probs.mean(axis=0)
# Auxiliary loss
loss = alpha * num_experts * np.sum(f * P)
return loss, f, P
loss, f_dist, p_dist = load_balancing_loss(
raw_probs, final_probs, num_experts
)
print("Expert load distribution:")
print(f" Token fraction (f): {np.round(f_dist, 3)}")
print(f" Prob fraction (P): {np.round(p_dist, 3)}")
print(f"\nLoad balancing loss: {loss:.6f}")
print(f"Perfect balance: f = [{1/num_experts:.3f}] for each expert")
You’ll see uneven token fractions — some experts get more traffic than others. In a real training loop, the load balancing loss would gradually push the router toward even distribution. Perfect balance gives f = 0.250 for each of 4 experts.
Warning: Setting alpha too high kills model quality. The balance loss fights the routing loss. If alpha is too large, the router spreads tokens perfectly but ignores which expert is best for each token. Start with alpha=0.01. Increase only if you see expert collapse.
How Real Models Use MoE
Our mini MoE captures the core pattern. Production models scale this up, but the principles are identical.
Mixtral 8x7B brought MoE to mainstream attention [3]. Eight experts, each 7B parameters, top-2 routing. Total: about 47B parameters. Active per token: roughly 13B. You get 47B capacity at 13B compute.
| Model | Total Params | Active Params | Experts | Top-k | Key Innovation |
|---|---|---|---|---|---|
| Mixtral 8x7B | ~47B | ~13B | 8 | 2 | Sparse MoE in every transformer layer |
| DeepSeek V3 | 671B | ~37B | 256 | 8 | Fine-grained experts + loss-free balancing |
| Switch Transformer | up to 1.6T | varies | 64-2048 | 1 | Top-1 for maximum sparsity |
| GPT-4 (rumored) | ~1.8T | ~280B | 16 | 2 | Not officially confirmed |
DeepSeek V3 uses 256 fine-grained experts with top-8 routing [4]. Instead of our auxiliary loss, they add bias terms to the router. If an expert gets fewer tokens, its bias gradually increases. No separate loss term. No alpha tuning.
The parameter efficiency math is compelling. This code prints the comparison:
models = [
("Dense 13B", 13, 13, 1.0),
("Mixtral 8x7B", 47, 13, 47/13),
("DeepSeek V3", 671, 37, 671/37),
("Switch-Base 128E", 7, 0.1, 7/0.1),
]
print(f"{'Model':<20} {'Total':>8} {'Active':>8} {'Ratio':>8}")
print("-" * 48)
for name, total, active, ratio in models:
print(f"{name:<20} {total:>7.0f}B {active:>7.1f}B {ratio:>7.1f}x")
python
Model Total Active Ratio
------------------------------------------------
Dense 13B 13B 13.0B 1.0x
Mixtral 8x7B 47B 13.0B 3.6x
DeepSeek V3 671B 37.0B 18.1x
Switch-Base 128E 7B 0.1B 70.0x
Key Insight: MoE models trade memory for compute. You need RAM for all expert parameters, but you only pay the FLOP cost for active experts. MoE models are fast to run but expensive to host.
Visualizing Expert Assignments
Let’s trace the routing pattern. This text-based heatmap shows which experts each token selected and the corresponding weights. Selected experts show their weight value. Non-selected experts show dashes.
def visualize_routing(gated_probs, tokens_labels=None):
"""Print a text-based routing heatmap."""
batch_size, num_experts = gated_probs.shape
if tokens_labels is None:
tokens_labels = [f"Token {i}" for i in range(batch_size)]
print(f"\n{'':>10}", end="")
for e in range(num_experts):
print(f" Expert {e}", end="")
print()
print("-" * (12 + num_experts * 10))
for i in range(batch_size):
print(f"{tokens_labels[i]:>10}", end="")
for e in range(num_experts):
w = gated_probs[i, e]
if w > 0:
bar = "#" * int(w * 20)
print(f" {w:.3f}{bar:>5}", end="")
else:
print(f" {'---':>8}", end="")
print()
visualize_routing(final_probs)
tokens_per_expert = (final_probs > 0).sum(axis=0)
print(f"\nTokens per expert: {tokens_per_expert}")
print(f"Ideal (uniform): {batch_size * top_k / num_experts:.1f} each")
Some experts are popular, others underused. In a trained model, this specialization is intentional. But during early training, you want balance. That’s what the load balancing loss ensures.
Adding Noise for Better Exploration
Production MoE systems inject noise into router logits during training. Without noise, the router gets stuck. If Expert 0 scores slightly higher than Expert 1 for a token, Expert 0 always wins. Expert 1 never learns from that token type.
Gaussian noise before softmax introduces exploration. Sometimes the underdog wins even when it has a lower base score. At inference time, noise is turned off for deterministic routing.
The code adds noise scaled by a learnable projection. The W_noise matrix controls how much randomness each token-expert pair receives. Larger noise means more exploration but also more variance in routing decisions.
def noisy_top_k_gating(tokens, W_gate, W_noise, top_k, training=True):
"""Router with optional noise injection for training."""
gate_logits = tokens @ W_gate
if training:
noise_scale = np.abs(tokens @ W_noise)
noise = np.random.randn(*gate_logits.shape) * noise_scale
gate_logits = gate_logits + noise
gate_probs = softmax(gate_logits)
gated_probs, top_k_idx = top_k_gating(gate_probs, top_k)
return gated_probs, gate_probs, top_k_idx
W_noise = np.random.randn(d_model, num_experts) * 0.1
gated_clean, _, experts_clean = noisy_top_k_gating(
tokens, W_gate, W_noise, top_k, training=False
)
gated_noisy, _, experts_noisy = noisy_top_k_gating(
tokens, W_gate, W_noise, top_k, training=True
)
changed = np.sum(np.any(experts_clean != experts_noisy, axis=1))
print(f"Routing changed by noise: {changed}/{batch_size} tokens")
print(f"\nClean routing: {experts_clean.tolist()}")
print(f"Noisy routing: {experts_noisy.tolist()}")
Noise flips routing for some tokens. During training, this prevents expert collapse. At inference, you turn it off.
Tip: DeepSeek V3 replaced auxiliary loss with a bias trick [4]. Each expert gets a learnable bias term in the router. Underused experts see their bias increase, attracting more traffic. No loss term. No alpha tuning. Simpler and more stable.
Common Mistakes and How to Fix Them
Mistake 1: Forgetting to renormalize after top-k selection
# WRONG
mask = np.zeros_like(gate_probs)
for i in range(batch_size):
mask[i, top_k_indices[i]] = 1.0
gated_probs = gate_probs * mask # Weights sum to < 1!
Why it breaks: The selected weights summed to 1 across ALL experts. After masking, they sum to less than 1. Your MoE output is scaled down by roughly half.
# CORRECT — add renormalization gated_probs = gate_probs * mask gated_probs = gated_probs / gated_probs.sum(axis=1, keepdims=True)
Mistake 2: Sharing initialization across experts
# WRONG — all experts are the SAME object shared_params = init_expert(d_model, d_ff) experts = [shared_params for _ in range(num_experts)]
Why it breaks: All experts produce identical outputs. The router can’t differentiate them. MoE degenerates into one expert repeated four times.
# CORRECT — each expert gets independent random weights experts = [init_expert(d_model, d_ff) for _ in range(num_experts)]
Mistake 3: Setting load balancing alpha too high
# WRONG — alpha = 1.0 dominates the loss loss = 1.0 * num_experts * np.sum(f * P)
Why it breaks: The router optimizes for perfect balance, not routing quality. Every expert gets equal traffic regardless of token content. Start with alpha=0.01. Go up to 0.1 only if you see expert collapse.
{
type: ‘exercise’,
id: ‘moe-router-ex1’,
title: ‘Exercise 1: Implement Expert Capacity’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Real MoE systems cap the number of tokens each expert can handle per batch. Tokens that exceed capacity get dropped (output = zeros). Complete the function to implement capacity-limited routing. Capacity = int(top_k * batch_size / num_experts * capacity_factor).’,
starterCode: ‘def capacity_limited_routing(gated_probs, top_k, capacity_factor=1.5):\n batch_size, num_experts = gated_probs.shape\n capacity = int(top_k * batch_size / num_experts * capacity_factor)\n \n limited_probs = np.zeros_like(gated_probs)\n \n for expert_idx in range(num_experts):\n # Find tokens assigned to this expert\n assigned = np.where(gated_probs[:, expert_idx] > 0)[0]\n \n # TODO: Keep only the top capacity tokens by weight\n # Sort assigned tokens by weight, keep the highest ones\n pass\n \n # Renormalize rows\n row_sums = limited_probs.sum(axis=1, keepdims=True)\n row_sums = np.where(row_sums == 0, 1, row_sums)\n limited_probs = limited_probs / row_sums\n \n return limited_probs, capacity\n\nresult, cap = capacity_limited_routing(final_probs, top_k)\nprint(f”Capacity per expert: {cap}”)\nprint(f”Tokens per expert (limited): {(result > 0).sum(axis=0).tolist()}”)’,
testCases: [
{ id: ‘tc1’, input: ‘print(capacity_limited_routing(final_probs, top_k)[1])’, expectedOutput: ‘4’, description: ‘Capacity should be 4 with batch=6, experts=4, top_k=2, factor=1.5’ },
{ id: ‘tc2’, input: ‘r, _ = capacity_limited_routing(final_probs, top_k)\nprint(all(r.sum(axis=1) <= 1.001))’, expectedOutput: ‘True’, description: ‘Renormalized weights should sum to at most 1’ },
],
hints: [
‘Sort the assigned indices by their gating weight for this expert. Use np.argsort and take the last capacity elements.’,
‘Full inner loop:\nsorted_idx = assigned[np.argsort(gated_probs[assigned, expert_idx])][-capacity:]\nlimited_probs[sorted_idx, expert_idx] = gated_probs[sorted_idx, expert_idx]’,
],
solution: ‘def capacity_limited_routing(gated_probs, top_k, capacity_factor=1.5):\n batch_size, num_experts = gated_probs.shape\n capacity = int(top_k * batch_size / num_experts * capacity_factor)\n limited_probs = np.zeros_like(gated_probs)\n for expert_idx in range(num_experts):\n assigned = np.where(gated_probs[:, expert_idx] > 0)[0]\n sorted_idx = assigned[np.argsort(gated_probs[assigned, expert_idx])]\n kept = sorted_idx[-capacity:]\n limited_probs[kept, expert_idx] = gated_probs[kept, expert_idx]\n row_sums = limited_probs.sum(axis=1, keepdims=True)\n row_sums = np.where(row_sums == 0, 1, row_sums)\n limited_probs = limited_probs / row_sums\n return limited_probs, capacity’,
solutionExplanation: ‘For each expert, sort its assigned tokens by weight and keep only the top capacity ones. Dropped tokens get zero output — a tradeoff for computational efficiency. The capacity factor of 1.5 gives 50% headroom above perfect balance.’,
xpReward: 20,
}
{
type: ‘exercise’,
id: ‘moe-balance-ex2’,
title: ‘Exercise 2: Compute Expert Diversity Score’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Write a function that measures routing diversity using Shannon entropy. Higher entropy means more balanced distribution. Return the entropy and whether it exceeds 90% of maximum entropy (which is log(N) for N experts).’,
starterCode: ‘def routing_diversity(gated_probs):\n “””Measure routing balance via Shannon entropy.”””\n num_experts = gated_probs.shape[1]\n \n tokens_per_expert = (gated_probs > 0).astype(float).sum(axis=0)\n fractions = tokens_per_expert / tokens_per_expert.sum()\n \n # TODO: Compute entropy = -sum(p * log(p))\n # Handle p=0 case (0*log(0) should be 0)\n entropy = 0.0 # Replace this\n \n max_entropy = np.log(num_experts)\n is_balanced = entropy > 0.9 * max_entropy\n \n return round(float(entropy), 4), bool(is_balanced)\n\nentropy, balanced = routing_diversity(final_probs)\nprint(f”Entropy: {entropy}”)\nprint(f”Max entropy: {round(float(np.log(4)), 4)}”)\nprint(f”Balanced (>90% max): {balanced}”)’,
testCases: [
{ id: ‘tc1’, input: ‘perfect = np.array([[0.5, 0.5, 0, 0], [0, 0, 0.5, 0.5], [0.5, 0, 0, 0.5], [0, 0.5, 0.5, 0]])\ne, b = routing_diversity(perfect)\nprint(b)’, expectedOutput: ‘True’, description: ‘Perfectly balanced routing passes 90% threshold’ },
{ id: ‘tc2’, input: ‘collapsed = np.array([[1.0, 0, 0, 0]] * 4)\ne, b = routing_diversity(collapsed)\nprint(b)’, expectedOutput: ‘False’, description: ‘Collapsed routing fails balance check’ },
],
hints: [
‘Handle p=0 with np.where: safe_log = np.where(fractions > 0, np.log(fractions), 0).’,
‘Full solution: safe_log = np.where(fractions > 0, np.log(fractions), 0)\nentropy = float(-np.sum(fractions * safe_log))’,
],
solution: ‘def routing_diversity(gated_probs):\n num_experts = gated_probs.shape[1]\n tokens_per_expert = (gated_probs > 0).astype(float).sum(axis=0)\n fractions = tokens_per_expert / tokens_per_expert.sum()\n safe_log = np.where(fractions > 0, np.log(fractions), 0)\n entropy = float(-np.sum(fractions * safe_log))\n max_entropy = np.log(num_experts)\n is_balanced = entropy > 0.9 * max_entropy\n return round(entropy, 4), bool(is_balanced)’,
solutionExplanation: ‘Shannon entropy measures uncertainty. Max entropy = perfectly uniform distribution. Collapsed routing (one expert gets everything) gives entropy = 0. We use 90% of max as a practical threshold.’,
xpReward: 15,
}
When NOT to Use MoE
MoE isn’t always the right choice. Here are the real tradeoffs.
Don’t use MoE when:
- Your model is small. MoE overhead only pays off at scale. Below ~1B parameters, a dense model trains faster and deploys simpler.
- You’re memory-constrained. MoE stores ALL expert parameters. Mixtral needs ~47B in memory but only uses ~13B per forward pass. If 13B is your limit, go dense.
- You need exact reproducibility. With noise, the router can assign different experts to the same token across batches. That complicates deterministic inference.
- Single-GPU training. MoE shines when experts live on different GPUs. On one GPU, you don’t get the parallelism advantage.
Use MoE when:
- You need model capacity beyond your compute budget
- You have multi-GPU infrastructure for expert parallelism
- You’re building a general-purpose model (experts can specialize)
- Inference latency matters more than memory footprint
[UNDER THE HOOD]
In production, experts run on different GPUs. Mixtral distributes 2 experts per GPU across 4 GPUs. The router sends token indices to the right GPU via all-to-all communication. This is called expert parallelism — the most complex part of MoE deployment. Our NumPy version loops sequentially, but the production version is embarrassingly parallel.
Complete Code
Summary
You built a working Mixture of Experts layer from scratch. Here’s what each piece does:
- Router: Linear layer + softmax scores every token against every expert
- Top-k gating: Selects the best k experts and renormalizes weights
- Expert networks: Feed-forward networks (Linear-ReLU-Linear) for assigned tokens
- Load balancing loss: Prevents expert collapse by penalizing uneven routing
- Noisy gating: Noise injected during training for exploration
The key numbers: Mixtral 8x7B has 47B total but only 13B active per token (3.6x). DeepSeek V3 has 671B total with 37B active (18x).
Practice exercise: Extend the MoE to support top-1 gating like Switch Transformer. How does routing change? What happens to load balancing loss?
Frequently Asked Questions
Why not make one big expert instead of many small ones?
A single huge feed-forward network processes every token the same way. MoE lets different tokens take different paths. Experts specialize automatically — one handles syntax, another handles facts, another handles reasoning. This emerges during training without supervision.
How many experts should I use?
Mixtral uses 8. DeepSeek V3 uses 256. Google tested up to 2,048 in Switch Transformer. More experts means more specialization potential but higher memory costs and harder routing. Start with 4-8 and scale from there.
What’s the difference between dense MoE and sparse MoE?
Dense MoE sends every token to every expert. Sparse MoE uses top-k to activate only a few. Nearly all modern models are sparse — dense MoE offers no compute savings. When people say “MoE” today, they mean sparse.
Can I run MoE on a single GPU?
Yes, but you won’t get the speed advantage. The full model must fit in memory. Expert parallelism — spreading experts across GPUs — is what makes MoE practical at scale. For learning (like this tutorial), single-GPU NumPy is fine.
Does MoE work with top-1 routing?
Yes. Google’s Switch Transformer uses top-1 exclusively. It’s faster but harder to train stably. Top-2 provides a fallback — if the primary expert is wrong, the secondary compensates. Most production systems use top-2.
References
- Shazeer, N. et al. — “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.” ICLR 2017. arXiv:1701.06538
- Fedus, W. et al. — “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.” JMLR 2022. arXiv:2101.03961
- Jiang, A.Q. et al. — “Mixtral of Experts.” Mistral AI, 2024. arXiv:2401.04088
- DeepSeek-AI — “DeepSeek-V3 Technical Report.” 2024. arXiv:2412.19437
- Lepikhin, D. et al. — “GShard: Scaling Giant Models with Conditional Computation.” ICLR 2021. arXiv:2006.16668
- Riquelme, C. et al. — “Scaling Vision with Sparse Mixture of Experts.” NeurIPS 2021. arXiv:2106.05974
- Zoph, B. et al. — “ST-MoE: Designing Stable and Transferable Sparse Expert Models.” 2022. arXiv:2202.08906
Article reviewed: March 2026. NumPy 1.26, Python 3.11.
Free Course
Master Core Python — Your First Step into AI/ML
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
