Menu

Mixture of Experts (MoE): Build a Router in Python

Build a Mixture of Experts (MoE) layer in Python with NumPy. Router, top-k gating, load balancing, and expert networks — with runnable code and exercises.

Written by Selva Prabhakaran | 23 min read


This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser.

Build a working MoE layer with top-k gating, expert networks, and load balancing — all in pure NumPy.

GPT-4 has over a trillion parameters. Mixtral 8x7B has 47 billion. Yet Mixtral runs nearly as fast as a 13-billion-parameter model. How?

The secret is Mixture of Experts. Instead of running every token through all 47 billion parameters, a tiny router picks just 2 out of 8 experts per token. Each expert is a full feed-forward network, but only the selected ones activate. You get the capacity of a huge model at the cost of a small one.

That’s what you’ll build here. By the end, you’ll have a working MoE layer in NumPy — router, experts, top-k gating, and load balancing loss.

Here’s how the pieces fit together. Token embeddings enter the router — a small linear layer that scores each token against each expert. Softmax converts those scores into gating weights. Top-k selection picks the best 2 experts per token. Each selected expert processes its assigned tokens through a feed-forward network. The outputs get multiplied by gating weights and summed. A load balancing loss nudges the router to spread tokens evenly — without it, one expert hogs all the traffic.

Prerequisites

  • Python version: 3.9+
  • Required libraries: NumPy (1.24+)
  • Install: pip install numpy
  • Time to complete: 25-30 minutes

What Is Mixture of Experts?

Mixture of Experts is an architecture where multiple specialist sub-networks share a single model. A gating network decides which experts handle each input.

Think of it like a hospital. You don’t send every patient to every doctor. A triage nurse (the router) evaluates each patient and sends them to the right specialist. The hospital has the combined knowledge of all doctors, but each patient only uses one or two.

In neural networks, each “expert” is a feed-forward network. The router is a learned linear layer that maps each token to a probability distribution over experts.

ComponentRoleAnalogy
Router/GateDecides which experts handle each tokenTriage nurse
ExpertProcesses tokens through a feed-forward networkSpecialist doctor
Top-k SelectionPicks the best k experts per token“See these 2 doctors”
Load BalancingEnsures all experts get used evenlyFair patient distribution

We’ll create the simplest possible MoE — one router, four experts, and top-2 gating. The first code block sets up the configuration and creates random token embeddings. In a real transformer, these tokens would come from an attention layer. Here we use random vectors.

import numpy as np

np.random.seed(42)

# MoE configuration
d_model = 8       # embedding dimension
num_experts = 4   # number of expert networks
d_ff = 16         # feed-forward hidden dimension
top_k = 2         # experts selected per token
batch_size = 6    # number of tokens

# Random token embeddings (in real models, these come from attention layers)
tokens = np.random.randn(batch_size, d_model)

print(f"Tokens shape: {tokens.shape}")
print(f"Config: {num_experts} experts, top-{top_k} gating, d_model={d_model}")
python
Tokens shape: (6, 8)
Config: 4 experts, top-2 gating, d_model=8

Six tokens, each with 8 dimensions. Four experts are waiting. The router will decide which 2 experts each token visits.

How the Router Works

The router is the brain of MoE. It’s surprisingly simple — a single linear layer followed by a softmax.

For each token, the router produces a score for every expert. Softmax converts those scores into probabilities. The highest probabilities determine which experts get activated.

Here’s the math. Given a token embedding \(x$ and a router weight matrix \)W_g$ of shape \((d_{model}, n_{experts})\):

\[g(x) = \text{softmax}(x \cdot W_g)\]

Where:
\(g(x)\) = gating probabilities (one per expert, sums to 1)
– $x$ = token embedding vector
\(W_g\) = learnable router weights

If you’re not interested in the math, skip ahead — the practical code below is all you need.

In code, this is one matrix multiply and a softmax. The W_gate matrix maps each token from d_model dimensions to num_experts scores. Softmax normalizes those into probabilities.

def softmax(x, axis=-1):
    """Numerically stable softmax."""
    x_max = np.max(x, axis=axis, keepdims=True)
    exp_x = np.exp(x - x_max)
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)


# Router: single linear layer
W_gate = np.random.randn(d_model, num_experts) * 0.1

# Compute gating scores for all tokens
gate_logits = tokens @ W_gate       # (batch_size, num_experts)
gate_probs = softmax(gate_logits)   # (batch_size, num_experts)

print("Gate probabilities (each row sums to 1):")
print(np.round(gate_probs, 3))
print(f"\nRow sums: {np.round(gate_probs.sum(axis=1), 4)}")

You’ll see a (6, 4) matrix where each row sums to 1.0. The probabilities are close to uniform because the router isn’t trained — it’s random projections. The next step picks hard winners from these soft preferences.

Key Insight: The router is just a learned linear projection. A single matrix multiply followed by softmax decides the entire expert assignment. That’s what makes MoE routing cheap.

Top-k Gating: Picking the Winners

Why pick only k experts instead of using all of them? If you send every token to every expert, you’ve built a standard dense model. No savings at all. Top-k selection is what gives MoE its speed advantage.

Here’s how top-k gating works. For each token, find the indices of the k highest probabilities. Zero out everything else. Then renormalize the remaining weights so they sum to 1. Without renormalization, the output would be scaled down.

The code uses np.argsort to sort expert indices by probability. We take the last top_k indices (highest scores), build a mask, then divide by the new sum.

def top_k_gating(gate_probs, k):
    """Select top-k experts per token and renormalize weights."""
    batch_size, num_experts = gate_probs.shape

    # Find top-k expert indices per token
    top_k_indices = np.argsort(gate_probs, axis=1)[:, -k:]

    # Build a mask: 1 for selected experts, 0 for the rest
    mask = np.zeros_like(gate_probs)
    for i in range(batch_size):
        mask[i, top_k_indices[i]] = 1.0

    # Zero out non-selected experts and renormalize
    gated_probs = gate_probs * mask
    gated_probs = gated_probs / gated_probs.sum(axis=1, keepdims=True)

    return gated_probs, top_k_indices


gated_probs, selected_experts = top_k_gating(gate_probs, top_k)

print("Selected experts per token:")
for i in range(batch_size):
    experts_sel = selected_experts[i]
    weights = [gated_probs[i, e] for e in experts_sel]
    print(f"  Token {i}: experts {experts_sel} "
          f"with weights {[f'{w:.3f}' for w in weights]}")

Each token now has exactly 2 experts assigned with renormalized weights summing to 1. The gating weights control how much each expert contributes to the final output.

Quick check — can you spot a potential problem? Some experts might appear frequently while others sit idle. We’ll fix this with load balancing later.

Building Expert Networks

Each expert is a standard feed-forward network: linear layer, ReLU, second linear layer. This is the same structure inside a transformer block. The only difference is that MoE has multiple copies, and each token visits a subset.

We’ll initialize four experts. Each has two weight matrices: W1 maps from d_model to d_ff (expanding the representation), and W2 maps back from d_ff to d_model. ReLU sits between them. We use He initialization (* np.sqrt(2.0 / fan_in)) because it prevents gradient issues in ReLU networks.

def relu(x):
    """ReLU activation: max(0, x)."""
    return np.maximum(0, x)


def init_expert(d_model, d_ff):
    """Initialize one expert's weights (two linear layers)."""
    return {
        'W1': np.random.randn(d_model, d_ff) * np.sqrt(2.0 / d_model),
        'W2': np.random.randn(d_ff, d_model) * np.sqrt(2.0 / d_ff),
    }


def expert_forward(x, expert_params):
    """Forward pass: Linear -> ReLU -> Linear."""
    hidden = relu(x @ expert_params['W1'])
    output = hidden @ expert_params['W2']
    return output


# Create all experts
experts = [init_expert(d_model, d_ff) for _ in range(num_experts)]

# Test one expert with a single token
test_output = expert_forward(tokens[0:1], experts[0])
print(f"Input shape:  {tokens[0:1].shape}")
print(f"Output shape: {test_output.shape}")
python
Input shape:  (1, 8)
Output shape: (1, 8)

The expert takes in a (1, 8) token and returns a (1, 8) output. Dimensionality stays the same — d_model in, d_model out. Internally it expands to d_ff=16 dimensions before compressing back. This expansion-compression pattern adds representational capacity.

Tip: For GELU or SiLU activations (used in Llama, Mistral), switch to Xavier initialization. He initialization is specifically designed for ReLU. Using the wrong scheme causes training instability.

The MoE Forward Pass

This is where everything connects. The forward pass combines the router, top-k gating, and expert networks into one function.

Here’s the data flow:

  1. Router scores each token against all experts
  2. Top-k gating picks the best 2 experts per token
  3. Each selected expert processes its assigned tokens
  4. Outputs are scaled by gating weights and summed

The implementation loops over experts. For each one, it finds which tokens selected that expert, runs them through the network, scales by gating weight, and accumulates. This approach is cleaner than looping over tokens because each expert processes a batch at once.

def moe_forward(tokens, W_gate, experts, top_k):
    """Full MoE forward pass: route, gate, compute, combine."""
    batch_size, d_model = tokens.shape

    # Step 1: Router produces gating probabilities
    gate_logits = tokens @ W_gate
    gate_probs = softmax(gate_logits)

    # Step 2: Top-k selection
    gated_probs, top_k_indices = top_k_gating(gate_probs, top_k)

    # Step 3 & 4: Run selected experts and combine
    output = np.zeros_like(tokens)

    for expert_idx in range(len(experts)):
        token_mask = gated_probs[:, expert_idx] > 0
        if not token_mask.any():
            continue

        expert_tokens = tokens[token_mask]
        expert_weights = gated_probs[token_mask, expert_idx]

        expert_out = expert_forward(expert_tokens, experts[expert_idx])
        weighted_out = expert_out * expert_weights[:, np.newaxis]

        output[token_mask] += weighted_out

    return output, gate_probs, gated_probs


# Run the MoE layer
moe_output, raw_probs, final_probs = moe_forward(
    tokens, W_gate, experts, top_k
)

print(f"Input shape:  {tokens.shape}")
print(f"Output shape: {moe_output.shape}")
python
Input shape:  (6, 8)
Output shape: (6, 8)

The shape is preserved — (6, 8) in, (6, 8) out. But every token was processed by its own pair of experts. Each got a personalized computation through a different expert combination.

What I find elegant about this: the router learns (during training) which expert handles which “type” of token. In a language model, one expert might specialize in syntax, another in facts, another in code. The model discovers this on its own.

The Load Balancing Problem

Without load balancing, MoE training collapses. Here’s why.

Early in training, the router assigns tokens randomly. Some experts get lucky — they receive slightly better tokens and improve faster. The router notices and sends more tokens their way. This creates a vicious cycle. Within a few hundred steps, one expert handles 90% of tokens while the others sit idle.

This is called expert collapse. It’s the biggest failure mode in MoE training.

The fix is a load balancing loss. Google’s Switch Transformer paper defined the standard formula [2]. For each expert, track two things: what fraction of tokens it receives (\(f_i\)), and what fraction of the total gating probability it gets (\(P_i\)).

\[L_{balance} = \alpha \cdot N \cdot \sum_{i=1}^{N} f_i \cdot P_i\]

Where:
– $N$ = number of experts
\(f_i\) = fraction of tokens routed to expert $i$
\(P_i\) = mean gating probability assigned to expert $i$
\(\alpha\) = balancing coefficient (typically 0.01-0.1)

Why multiply \(f_i\) and \(P_i\)? An expert with many tokens AND high probability is hogging resources. The loss pushes the router to spread the load.

The code below computes both fractions. alpha=0.01 is a safe starting value — typical production values range from 0.01 to 0.1.

def load_balancing_loss(gate_probs, gated_probs, num_experts, alpha=0.01):
    """Compute auxiliary load balancing loss."""
    batch_size = gate_probs.shape[0]

    # f_i: fraction of tokens routed to each expert
    tokens_per_expert = (gated_probs > 0).astype(float).sum(axis=0)
    f = tokens_per_expert / batch_size

    # P_i: mean gating probability per expert
    P = gate_probs.mean(axis=0)

    # Auxiliary loss
    loss = alpha * num_experts * np.sum(f * P)

    return loss, f, P


loss, f_dist, p_dist = load_balancing_loss(
    raw_probs, final_probs, num_experts
)

print("Expert load distribution:")
print(f"  Token fraction (f):  {np.round(f_dist, 3)}")
print(f"  Prob fraction (P):   {np.round(p_dist, 3)}")
print(f"\nLoad balancing loss: {loss:.6f}")
print(f"Perfect balance: f = [{1/num_experts:.3f}] for each expert")

You’ll see uneven token fractions — some experts get more traffic than others. In a real training loop, the load balancing loss would gradually push the router toward even distribution. Perfect balance gives f = 0.250 for each of 4 experts.

Warning: Setting alpha too high kills model quality. The balance loss fights the routing loss. If alpha is too large, the router spreads tokens perfectly but ignores which expert is best for each token. Start with alpha=0.01. Increase only if you see expert collapse.

How Real Models Use MoE

Our mini MoE captures the core pattern. Production models scale this up, but the principles are identical.

Mixtral 8x7B brought MoE to mainstream attention [3]. Eight experts, each 7B parameters, top-2 routing. Total: about 47B parameters. Active per token: roughly 13B. You get 47B capacity at 13B compute.

ModelTotal ParamsActive ParamsExpertsTop-kKey Innovation
Mixtral 8x7B~47B~13B82Sparse MoE in every transformer layer
DeepSeek V3671B~37B2568Fine-grained experts + loss-free balancing
Switch Transformerup to 1.6Tvaries64-20481Top-1 for maximum sparsity
GPT-4 (rumored)~1.8T~280B162Not officially confirmed

DeepSeek V3 uses 256 fine-grained experts with top-8 routing [4]. Instead of our auxiliary loss, they add bias terms to the router. If an expert gets fewer tokens, its bias gradually increases. No separate loss term. No alpha tuning.

The parameter efficiency math is compelling. This code prints the comparison:

models = [
    ("Dense 13B", 13, 13, 1.0),
    ("Mixtral 8x7B", 47, 13, 47/13),
    ("DeepSeek V3", 671, 37, 671/37),
    ("Switch-Base 128E", 7, 0.1, 7/0.1),
]

print(f"{'Model':<20} {'Total':>8} {'Active':>8} {'Ratio':>8}")
print("-" * 48)
for name, total, active, ratio in models:
    print(f"{name:<20} {total:>7.0f}B {active:>7.1f}B {ratio:>7.1f}x")
python
Model                   Total   Active    Ratio
------------------------------------------------
Dense 13B                  13B    13.0B     1.0x
Mixtral 8x7B               47B    13.0B     3.6x
DeepSeek V3               671B    37.0B    18.1x
Switch-Base 128E            7B     0.1B    70.0x
Key Insight: MoE models trade memory for compute. You need RAM for all expert parameters, but you only pay the FLOP cost for active experts. MoE models are fast to run but expensive to host.

Visualizing Expert Assignments

Let’s trace the routing pattern. This text-based heatmap shows which experts each token selected and the corresponding weights. Selected experts show their weight value. Non-selected experts show dashes.

def visualize_routing(gated_probs, tokens_labels=None):
    """Print a text-based routing heatmap."""
    batch_size, num_experts = gated_probs.shape

    if tokens_labels is None:
        tokens_labels = [f"Token {i}" for i in range(batch_size)]

    print(f"\n{'':>10}", end="")
    for e in range(num_experts):
        print(f"  Expert {e}", end="")
    print()
    print("-" * (12 + num_experts * 10))

    for i in range(batch_size):
        print(f"{tokens_labels[i]:>10}", end="")
        for e in range(num_experts):
            w = gated_probs[i, e]
            if w > 0:
                bar = "#" * int(w * 20)
                print(f"  {w:.3f}{bar:>5}", end="")
            else:
                print(f"  {'---':>8}", end="")
        print()


visualize_routing(final_probs)

tokens_per_expert = (final_probs > 0).sum(axis=0)
print(f"\nTokens per expert: {tokens_per_expert}")
print(f"Ideal (uniform): {batch_size * top_k / num_experts:.1f} each")

Some experts are popular, others underused. In a trained model, this specialization is intentional. But during early training, you want balance. That’s what the load balancing loss ensures.

Adding Noise for Better Exploration

Production MoE systems inject noise into router logits during training. Without noise, the router gets stuck. If Expert 0 scores slightly higher than Expert 1 for a token, Expert 0 always wins. Expert 1 never learns from that token type.

Gaussian noise before softmax introduces exploration. Sometimes the underdog wins even when it has a lower base score. At inference time, noise is turned off for deterministic routing.

The code adds noise scaled by a learnable projection. The W_noise matrix controls how much randomness each token-expert pair receives. Larger noise means more exploration but also more variance in routing decisions.

def noisy_top_k_gating(tokens, W_gate, W_noise, top_k, training=True):
    """Router with optional noise injection for training."""
    gate_logits = tokens @ W_gate

    if training:
        noise_scale = np.abs(tokens @ W_noise)
        noise = np.random.randn(*gate_logits.shape) * noise_scale
        gate_logits = gate_logits + noise

    gate_probs = softmax(gate_logits)
    gated_probs, top_k_idx = top_k_gating(gate_probs, top_k)

    return gated_probs, gate_probs, top_k_idx


W_noise = np.random.randn(d_model, num_experts) * 0.1

gated_clean, _, experts_clean = noisy_top_k_gating(
    tokens, W_gate, W_noise, top_k, training=False
)
gated_noisy, _, experts_noisy = noisy_top_k_gating(
    tokens, W_gate, W_noise, top_k, training=True
)

changed = np.sum(np.any(experts_clean != experts_noisy, axis=1))
print(f"Routing changed by noise: {changed}/{batch_size} tokens")
print(f"\nClean routing:  {experts_clean.tolist()}")
print(f"Noisy routing:  {experts_noisy.tolist()}")

Noise flips routing for some tokens. During training, this prevents expert collapse. At inference, you turn it off.

Tip: DeepSeek V3 replaced auxiliary loss with a bias trick [4]. Each expert gets a learnable bias term in the router. Underused experts see their bias increase, attracting more traffic. No loss term. No alpha tuning. Simpler and more stable.

Common Mistakes and How to Fix Them

Mistake 1: Forgetting to renormalize after top-k selection

# WRONG
mask = np.zeros_like(gate_probs)
for i in range(batch_size):
    mask[i, top_k_indices[i]] = 1.0
gated_probs = gate_probs * mask  # Weights sum to < 1!

Why it breaks: The selected weights summed to 1 across ALL experts. After masking, they sum to less than 1. Your MoE output is scaled down by roughly half.

# CORRECT — add renormalization
gated_probs = gate_probs * mask
gated_probs = gated_probs / gated_probs.sum(axis=1, keepdims=True)

Mistake 2: Sharing initialization across experts

# WRONG — all experts are the SAME object
shared_params = init_expert(d_model, d_ff)
experts = [shared_params for _ in range(num_experts)]

Why it breaks: All experts produce identical outputs. The router can’t differentiate them. MoE degenerates into one expert repeated four times.

# CORRECT — each expert gets independent random weights
experts = [init_expert(d_model, d_ff) for _ in range(num_experts)]

Mistake 3: Setting load balancing alpha too high

# WRONG — alpha = 1.0 dominates the loss
loss = 1.0 * num_experts * np.sum(f * P)

Why it breaks: The router optimizes for perfect balance, not routing quality. Every expert gets equal traffic regardless of token content. Start with alpha=0.01. Go up to 0.1 only if you see expert collapse.

{
type: ‘exercise’,
id: ‘moe-router-ex1’,
title: ‘Exercise 1: Implement Expert Capacity’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Real MoE systems cap the number of tokens each expert can handle per batch. Tokens that exceed capacity get dropped (output = zeros). Complete the function to implement capacity-limited routing. Capacity = int(top_k * batch_size / num_experts * capacity_factor).’,
starterCode: ‘def capacity_limited_routing(gated_probs, top_k, capacity_factor=1.5):\n batch_size, num_experts = gated_probs.shape\n capacity = int(top_k * batch_size / num_experts * capacity_factor)\n \n limited_probs = np.zeros_like(gated_probs)\n \n for expert_idx in range(num_experts):\n # Find tokens assigned to this expert\n assigned = np.where(gated_probs[:, expert_idx] > 0)[0]\n \n # TODO: Keep only the top capacity tokens by weight\n # Sort assigned tokens by weight, keep the highest ones\n pass\n \n # Renormalize rows\n row_sums = limited_probs.sum(axis=1, keepdims=True)\n row_sums = np.where(row_sums == 0, 1, row_sums)\n limited_probs = limited_probs / row_sums\n \n return limited_probs, capacity\n\nresult, cap = capacity_limited_routing(final_probs, top_k)\nprint(f”Capacity per expert: {cap}”)\nprint(f”Tokens per expert (limited): {(result > 0).sum(axis=0).tolist()}”)’,
testCases: [
{ id: ‘tc1’, input: ‘print(capacity_limited_routing(final_probs, top_k)[1])’, expectedOutput: ‘4’, description: ‘Capacity should be 4 with batch=6, experts=4, top_k=2, factor=1.5’ },
{ id: ‘tc2’, input: ‘r, _ = capacity_limited_routing(final_probs, top_k)\nprint(all(r.sum(axis=1) <= 1.001))’, expectedOutput: ‘True’, description: ‘Renormalized weights should sum to at most 1’ },
],
hints: [
‘Sort the assigned indices by their gating weight for this expert. Use np.argsort and take the last capacity elements.’,
‘Full inner loop:\nsorted_idx = assigned[np.argsort(gated_probs[assigned, expert_idx])][-capacity:]\nlimited_probs[sorted_idx, expert_idx] = gated_probs[sorted_idx, expert_idx]’,
],
solution: ‘def capacity_limited_routing(gated_probs, top_k, capacity_factor=1.5):\n batch_size, num_experts = gated_probs.shape\n capacity = int(top_k * batch_size / num_experts * capacity_factor)\n limited_probs = np.zeros_like(gated_probs)\n for expert_idx in range(num_experts):\n assigned = np.where(gated_probs[:, expert_idx] > 0)[0]\n sorted_idx = assigned[np.argsort(gated_probs[assigned, expert_idx])]\n kept = sorted_idx[-capacity:]\n limited_probs[kept, expert_idx] = gated_probs[kept, expert_idx]\n row_sums = limited_probs.sum(axis=1, keepdims=True)\n row_sums = np.where(row_sums == 0, 1, row_sums)\n limited_probs = limited_probs / row_sums\n return limited_probs, capacity’,
solutionExplanation: ‘For each expert, sort its assigned tokens by weight and keep only the top capacity ones. Dropped tokens get zero output — a tradeoff for computational efficiency. The capacity factor of 1.5 gives 50% headroom above perfect balance.’,
xpReward: 20,
}

{
type: ‘exercise’,
id: ‘moe-balance-ex2’,
title: ‘Exercise 2: Compute Expert Diversity Score’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Write a function that measures routing diversity using Shannon entropy. Higher entropy means more balanced distribution. Return the entropy and whether it exceeds 90% of maximum entropy (which is log(N) for N experts).’,
starterCode: ‘def routing_diversity(gated_probs):\n “””Measure routing balance via Shannon entropy.”””\n num_experts = gated_probs.shape[1]\n \n tokens_per_expert = (gated_probs > 0).astype(float).sum(axis=0)\n fractions = tokens_per_expert / tokens_per_expert.sum()\n \n # TODO: Compute entropy = -sum(p * log(p))\n # Handle p=0 case (0*log(0) should be 0)\n entropy = 0.0 # Replace this\n \n max_entropy = np.log(num_experts)\n is_balanced = entropy > 0.9 * max_entropy\n \n return round(float(entropy), 4), bool(is_balanced)\n\nentropy, balanced = routing_diversity(final_probs)\nprint(f”Entropy: {entropy}”)\nprint(f”Max entropy: {round(float(np.log(4)), 4)}”)\nprint(f”Balanced (>90% max): {balanced}”)’,
testCases: [
{ id: ‘tc1’, input: ‘perfect = np.array([[0.5, 0.5, 0, 0], [0, 0, 0.5, 0.5], [0.5, 0, 0, 0.5], [0, 0.5, 0.5, 0]])\ne, b = routing_diversity(perfect)\nprint(b)’, expectedOutput: ‘True’, description: ‘Perfectly balanced routing passes 90% threshold’ },
{ id: ‘tc2’, input: ‘collapsed = np.array([[1.0, 0, 0, 0]] * 4)\ne, b = routing_diversity(collapsed)\nprint(b)’, expectedOutput: ‘False’, description: ‘Collapsed routing fails balance check’ },
],
hints: [
‘Handle p=0 with np.where: safe_log = np.where(fractions > 0, np.log(fractions), 0).’,
‘Full solution: safe_log = np.where(fractions > 0, np.log(fractions), 0)\nentropy = float(-np.sum(fractions * safe_log))’,
],
solution: ‘def routing_diversity(gated_probs):\n num_experts = gated_probs.shape[1]\n tokens_per_expert = (gated_probs > 0).astype(float).sum(axis=0)\n fractions = tokens_per_expert / tokens_per_expert.sum()\n safe_log = np.where(fractions > 0, np.log(fractions), 0)\n entropy = float(-np.sum(fractions * safe_log))\n max_entropy = np.log(num_experts)\n is_balanced = entropy > 0.9 * max_entropy\n return round(entropy, 4), bool(is_balanced)’,
solutionExplanation: ‘Shannon entropy measures uncertainty. Max entropy = perfectly uniform distribution. Collapsed routing (one expert gets everything) gives entropy = 0. We use 90% of max as a practical threshold.’,
xpReward: 15,
}

When NOT to Use MoE

MoE isn’t always the right choice. Here are the real tradeoffs.

Don’t use MoE when:

  • Your model is small. MoE overhead only pays off at scale. Below ~1B parameters, a dense model trains faster and deploys simpler.
  • You’re memory-constrained. MoE stores ALL expert parameters. Mixtral needs ~47B in memory but only uses ~13B per forward pass. If 13B is your limit, go dense.
  • You need exact reproducibility. With noise, the router can assign different experts to the same token across batches. That complicates deterministic inference.
  • Single-GPU training. MoE shines when experts live on different GPUs. On one GPU, you don’t get the parallelism advantage.

Use MoE when:

  • You need model capacity beyond your compute budget
  • You have multi-GPU infrastructure for expert parallelism
  • You’re building a general-purpose model (experts can specialize)
  • Inference latency matters more than memory footprint

[UNDER THE HOOD]
In production, experts run on different GPUs. Mixtral distributes 2 experts per GPU across 4 GPUs. The router sends token indices to the right GPU via all-to-all communication. This is called expert parallelism — the most complex part of MoE deployment. Our NumPy version loops sequentially, but the production version is embarrassingly parallel.

Complete Code

Click to expand the full script (copy-paste and run)
# Complete code from: Mixture of Experts (MoE) in Python
# Requires: pip install numpy
# Python 3.9+

import numpy as np

np.random.seed(42)

# --- Configuration ---
d_model = 8
num_experts = 4
d_ff = 16
top_k = 2
batch_size = 6

# --- Token embeddings ---
tokens = np.random.randn(batch_size, d_model)

# --- Helper functions ---
def softmax(x, axis=-1):
    x_max = np.max(x, axis=axis, keepdims=True)
    exp_x = np.exp(x - x_max)
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)

def relu(x):
    return np.maximum(0, x)

def init_expert(d_model, d_ff):
    return {
        'W1': np.random.randn(d_model, d_ff) * np.sqrt(2.0 / d_model),
        'W2': np.random.randn(d_ff, d_model) * np.sqrt(2.0 / d_ff),
    }

def expert_forward(x, expert_params):
    hidden = relu(x @ expert_params['W1'])
    output = hidden @ expert_params['W2']
    return output

def top_k_gating(gate_probs, k):
    batch_size, num_experts = gate_probs.shape
    top_k_indices = np.argsort(gate_probs, axis=1)[:, -k:]
    mask = np.zeros_like(gate_probs)
    for i in range(batch_size):
        mask[i, top_k_indices[i]] = 1.0
    gated_probs = gate_probs * mask
    gated_probs = gated_probs / gated_probs.sum(axis=1, keepdims=True)
    return gated_probs, top_k_indices

def load_balancing_loss(gate_probs, gated_probs, num_experts, alpha=0.01):
    batch_size = gate_probs.shape[0]
    tokens_per_expert = (gated_probs > 0).astype(float).sum(axis=0)
    f = tokens_per_expert / batch_size
    P = gate_probs.mean(axis=0)
    loss = alpha * num_experts * np.sum(f * P)
    return loss, f, P

def moe_forward(tokens, W_gate, experts, top_k):
    batch_size, d_model = tokens.shape
    gate_logits = tokens @ W_gate
    gate_probs = softmax(gate_logits)
    gated_probs, top_k_indices = top_k_gating(gate_probs, top_k)
    output = np.zeros_like(tokens)
    for expert_idx in range(len(experts)):
        token_mask = gated_probs[:, expert_idx] > 0
        if not token_mask.any():
            continue
        expert_tokens = tokens[token_mask]
        expert_weights = gated_probs[token_mask, expert_idx]
        expert_out = expert_forward(expert_tokens, experts[expert_idx])
        weighted_out = expert_out * expert_weights[:, np.newaxis]
        output[token_mask] += weighted_out
    return output, gate_probs, gated_probs

# --- Run MoE ---
W_gate = np.random.randn(d_model, num_experts) * 0.1
experts = [init_expert(d_model, d_ff) for _ in range(num_experts)]

moe_output, raw_probs, final_probs = moe_forward(tokens, W_gate, experts, top_k)

print(f"Input shape:  {tokens.shape}")
print(f"Output shape: {moe_output.shape}")

# --- Load balancing ---
loss, f_dist, p_dist = load_balancing_loss(raw_probs, final_probs, num_experts)
print(f"\nLoad balancing loss: {loss:.6f}")
print(f"Token distribution:  {np.round(f_dist, 3)}")
print(f"Perfect balance:     {[round(1/num_experts, 3)] * num_experts}")

print("\nScript completed successfully.")

Summary

You built a working Mixture of Experts layer from scratch. Here’s what each piece does:

  • Router: Linear layer + softmax scores every token against every expert
  • Top-k gating: Selects the best k experts and renormalizes weights
  • Expert networks: Feed-forward networks (Linear-ReLU-Linear) for assigned tokens
  • Load balancing loss: Prevents expert collapse by penalizing uneven routing
  • Noisy gating: Noise injected during training for exploration

The key numbers: Mixtral 8x7B has 47B total but only 13B active per token (3.6x). DeepSeek V3 has 671B total with 37B active (18x).

Practice exercise: Extend the MoE to support top-1 gating like Switch Transformer. How does routing change? What happens to load balancing loss?

Solution
moe_out_top1, probs_top1, gated_top1 = moe_forward(
    tokens, W_gate, experts, top_k=1
)

loss_top1, f1, p1 = load_balancing_loss(probs_top1, gated_top1, num_experts)
print(f"Top-1 token distribution: {np.round(f1, 3)}")
print(f"Top-1 balance loss: {loss_top1:.6f}")
print(f"Top-2 balance loss: {loss:.6f}")

# Top-1 makes balance harder — each token picks only ONE expert
# The router has less flexibility to spread load

Top-1 is more sparse (faster) but harder to balance. Switch Transformer pairs top-1 with a capacity factor of 1.25 and aggressive load balancing.

Frequently Asked Questions

Why not make one big expert instead of many small ones?

A single huge feed-forward network processes every token the same way. MoE lets different tokens take different paths. Experts specialize automatically — one handles syntax, another handles facts, another handles reasoning. This emerges during training without supervision.

How many experts should I use?

Mixtral uses 8. DeepSeek V3 uses 256. Google tested up to 2,048 in Switch Transformer. More experts means more specialization potential but higher memory costs and harder routing. Start with 4-8 and scale from there.

What’s the difference between dense MoE and sparse MoE?

Dense MoE sends every token to every expert. Sparse MoE uses top-k to activate only a few. Nearly all modern models are sparse — dense MoE offers no compute savings. When people say “MoE” today, they mean sparse.

Can I run MoE on a single GPU?

Yes, but you won’t get the speed advantage. The full model must fit in memory. Expert parallelism — spreading experts across GPUs — is what makes MoE practical at scale. For learning (like this tutorial), single-GPU NumPy is fine.

Does MoE work with top-1 routing?

Yes. Google’s Switch Transformer uses top-1 exclusively. It’s faster but harder to train stably. Top-2 provides a fallback — if the primary expert is wrong, the secondary compensates. Most production systems use top-2.

References

  1. Shazeer, N. et al. — “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.” ICLR 2017. arXiv:1701.06538
  2. Fedus, W. et al. — “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.” JMLR 2022. arXiv:2101.03961
  3. Jiang, A.Q. et al. — “Mixtral of Experts.” Mistral AI, 2024. arXiv:2401.04088
  4. DeepSeek-AI — “DeepSeek-V3 Technical Report.” 2024. arXiv:2412.19437
  5. Lepikhin, D. et al. — “GShard: Scaling Giant Models with Conditional Computation.” ICLR 2021. arXiv:2006.16668
  6. Riquelme, C. et al. — “Scaling Vision with Sparse Mixture of Experts.” NeurIPS 2021. arXiv:2106.05974
  7. Zoph, B. et al. — “ST-MoE: Designing Stable and Transferable Sparse Expert Models.” 2022. arXiv:2202.08906

Article reviewed: March 2026. NumPy 1.26, Python 3.11.

Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Free Callback - Limited Slots
Not Sure Which Course to Start With?
Talk to our AI Counsellors and Practitioners. We'll help you clear all your questions for your background and goals, bridging the gap between your current skills and a career in AI.
10-digit mobile number
📞
Thank You!
We'll Call You Soon!
Our learning advisor will reach out within 24 hours.
(Check your inbox too — we've sent a confirmation)
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science