Lesson 1059 of 1570
Mixture of Experts — Why GPT-4 Is Smarter Than It Looks
MoE models route each token to a 'specialist' sub-network — same total size, way more efficient.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The big idea
- 2Why Mixture-of-Experts Models Like Mixtral Run Cheaper Than They Look
- 3The big idea
Concept cluster
Terms to connect while reading
Section 1
The big idea
Most modern frontier models (GPT-4, Mixtral, DeepSeek, Llama 4) are 'mixture of experts' (MoE). Instead of one giant brain, they have many smaller 'expert' brains. For each word generated, only 2–4 experts activate. The model has 1 trillion parameters total but only 'uses' 30 billion per word. Same quality, way faster.
Some examples
- Mixtral 8x7B: 8 experts, picks 2 per token. Looks like a 47B model, runs like a 13B.
- GPT-4 is widely believed to be MoE based on inference behavior (OpenAI hasn't confirmed).
- DeepSeek V3 has 671B total parameters but activates only 37B per token.
- Llama 4 went MoE in 2025 to keep up with this efficiency trick.
Try it!
Read the Mixtral paper's intro on Mistral's website. Notice the math: total params vs active params.
Key terms in this lesson
Section 2
Why Mixture-of-Experts Models Like Mixtral Run Cheaper Than They Look
Section 3
The big idea
Mixture-of-Experts models like Mixtral or DeepSeek have huge total parameter counts but only activate a handful of 'expert' subnetworks per token. The result: capability close to a much larger model at the inference cost of a smaller one.
Some examples
- Mixtral 8x7B has 47B total params but uses only ~13B per token at inference.
- DeepSeek V3's MoE design lets it match GPT-4-class quality at a fraction of the serving cost.
- Mistral's MoE variants run faster than dense models of similar quality on the same GPU.
- Some Llama variants now use MoE to compete with closed models cost-effectively.
Try it!
Read the spec page for Mixtral or DeepSeek V3. Note the gap between total and active parameters. That's the trick.
Key terms in this lesson
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Mixture of Experts — Why GPT-4 Is Smarter Than It Looks”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 11 min
Mixture-of-Experts Models: What MoE Means for Your Latency and Cost
How MoE architecture (Mixtral, DeepSeek, GPT-MoE) changes pricing and behavior.
Creators · 40 min
Mixture-of-Experts Models: Mixtral, DeepSeek, Qwen MoE
How MoE models work and when they're the right choice for your stack.
Creators · 20 min
Mixtral and MoE: Many Experts, Fewer Active Weights
Mixtral-style mixture-of-experts models teach an important local-model idea: total parameters and active parameters are not the same thing.
