Tendril

Lesson 1059 of 1570

Mixture of Experts — Why GPT-4 Is Smarter Than It Looks

MoE models route each token to a 'specialist' sub-network — same total size, way more efficient.

BuildersModel Families~16 min readBI2 · Representation & ReasoningBI3 · LearningPrint / PDF

Lesson map

What this lesson covers

27 min21 blocks6 concepts

Learning path

The main moves in order

1The big idea
2Why Mixture-of-Experts Models Like Mixtral Run Cheaper Than They Look
3The big idea

Concept cluster

Terms to connect while reading

mixture of expertsMoEmodel architectureefficiencyMixtralarchitecture

Sections7

Lists2

Notes6

Terms2

Section 1

The big idea

Most modern frontier models (GPT-4, Mixtral, DeepSeek, Llama 4) are 'mixture of experts' (MoE). Instead of one giant brain, they have many smaller 'expert' brains. For each word generated, only 2–4 experts activate. The model has 1 trillion parameters total but only 'uses' 30 billion per word. Same quality, way faster.

Some examples

Mixtral 8x7B: 8 experts, picks 2 per token. Looks like a 47B model, runs like a 13B.
GPT-4 is widely believed to be MoE based on inference behavior (OpenAI hasn't confirmed).
DeepSeek V3 has 671B total parameters but activates only 37B per token.
Llama 4 went MoE in 2025 to keep up with this efficiency trick.

Check-in 1. Got it so far?

Try it!

Read the Mixtral paper's intro on Mistral's website. Notice the math: total params vs active params.

Check-in 2. Got it so far?

Key terms in this lesson

Section 2

Why Mixture-of-Experts Models Like Mixtral Run Cheaper Than They Look

Section 3

The big idea

Mixture-of-Experts models like Mixtral or DeepSeek have huge total parameter counts but only activate a handful of 'expert' subnetworks per token. The result: capability close to a much larger model at the inference cost of a smaller one.

Some examples

Mixtral 8x7B has 47B total params but uses only ~13B per token at inference.
DeepSeek V3's MoE design lets it match GPT-4-class quality at a fraction of the serving cost.
Mistral's MoE variants run faster than dense models of similar quality on the same GPU.
Some Llama variants now use MoE to compete with closed models cost-effectively.

Check-in 3. Got it so far?

Try it!

Read the spec page for Mixtral or DeepSeek V3. Note the gap between total and active parameters. That's the trick.

Check-in 4. Got it so far?

Key terms in this lesson

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Mixture of Experts — Why GPT-4 Is Smarter Than It Looks”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Mixture of Experts — Why GPT-4 Is Smarter Than It Looks

The big idea

Some examples

Try it!

Why Mixture-of-Experts Models Like Mixtral Run Cheaper Than They Look

The big idea

Some examples

Try it!

Curious about “Mixture of Experts — Why GPT-4 Is Smarter Than It Looks”?

Keep going

Mixture of Experts — Why GPT-4 Is Smarter Than It Looks

The big idea

Some examples

Try it!

Why Mixture-of-Experts Models Like Mixtral Run Cheaper Than They Look

The big idea

Some examples

Try it!

Curious about “Mixture of Experts — Why GPT-4 Is Smarter Than It Looks”?

Keep going