Lesson 1004 of 1455
Mixture of Experts — Why GPT-4 Is Smarter Than It Looks
MoE models route each token to a 'specialist' sub-network — same total size, way more efficient.
Builders · Model Families · ~16 min read
The big idea
Most modern frontier models (GPT-4, Mixtral, DeepSeek, Llama 4) are 'mixture of experts' (MoE). Instead of one giant brain, they have many smaller 'expert' brains. For each word generated, only 2–4 experts activate. The model has 1 trillion parameters total but only 'uses' 30 billion per word. Same quality, way faster.
Some examples
- Mixtral 8x7B: 8 experts, picks 2 per token. Looks like a 47B model, runs like a 13B.
- GPT-4 is widely believed to be MoE based on inference behavior (OpenAI hasn't confirmed).
- DeepSeek V3 has 671B total parameters but activates only 37B per token.
- Llama 4 went MoE in 2025 to keep up with this efficiency trick.
Try it!
Read the Mixtral paper's intro on Mistral's website. Notice the math: total params vs active params.
Practice this safely
Try this with a school, hobby, or family example where the stakes are low. Use the AI output as a draft you can question, not as the final answer.
- 1Ask AI to explain mixture of experts in plain language, then underline anything that sounds uncertain or too broad.
- 2Give it one detail from "Mixture of Experts — Why GPT-4 Is Smarter Than It Looks" and ask for two possible next steps plus one reason each step might be wrong.
- 3Check model architecture against a trusted source, teacher, adult, expert, or original document before you use it.
End-of-lesson quiz
Check what stuck
8 questions · Score saves to your progress.
Lesson help
Questions are best handled with a grown-up here.
For this age range, Tendril keeps freeform AI chat paused until parent/guardian consent and child-safe moderation are fully verified. Use the quiz, notes, and related lessons below, or ask a parent, guardian, teacher, or librarian to work through the question with you.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 11 min
Mixture-of-Experts Models: What MoE Means for Your Latency and Cost
How MoE architecture (Mixtral, DeepSeek, GPT-MoE) changes pricing and behavior.
Creators · 20 min
Mixtral and MoE: Many Experts, Fewer Active Weights
Mixtral-style mixture-of-experts models teach an important local-model idea: total parameters and active parameters are not the same thing.
Creators · 17 min
Local Model Family: Microsoft Phi
Phi models show why small language models matter: they are designed for efficient local and edge scenarios, not for winning every frontier benchmark.
