Mixture of Experts — Why GPT-4 Is Smarter Than It Looks

MoE models route each token to a 'specialist' sub-network — same total size, way more efficient.

27 min · Reviewed 2026

The big idea

Most modern frontier models (GPT-4, Mixtral, DeepSeek, Llama 4) are 'mixture of experts' (MoE). Instead of one giant brain, they have many smaller 'expert' brains. For each word generated, only 2–4 experts activate. The model has 1 trillion parameters total but only 'uses' 30 billion per word. Same quality, way faster.

Some examples

Mixtral 8x7B: 8 experts, picks 2 per token. Looks like a 47B model, runs like a 13B.
GPT-4 is widely believed to be MoE based on inference behavior (OpenAI hasn't confirmed).
DeepSeek V3 has 671B total parameters but activates only 37B per token.
Llama 4 went MoE in 2025 to keep up with this efficiency trick.

Try it!

Read the Mixtral paper's intro on Mistral's website. Notice the math: total params vs active params.

Why Mixture-of-Experts Models Like Mixtral Run Cheaper Than They Look

The big idea

Mixture-of-Experts models like Mixtral or DeepSeek have huge total parameter counts but only activate a handful of 'expert' subnetworks per token. The result: capability close to a much larger model at the inference cost of a smaller one.

Some examples

Mixtral 8x7B has 47B total params but uses only ~13B per token at inference.
DeepSeek V3's MoE design lets it match GPT-4-class quality at a fraction of the serving cost.
Mistral's MoE variants run faster than dense models of similar quality on the same GPU.
Some Llama variants now use MoE to compete with closed models cost-effectively.

Try it!

Read the spec page for Mixtral or DeepSeek V3. Note the gap between total and active parameters. That's the trick.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-builders-models-mixture-of-experts-r7a8-teen

In a Mixture of Experts AI model, what is an 'expert'?
1. A single parameter that stores one piece of information
2. A specialized sub-network that handles certain types of tokens or patterns
3. A type of GPU hardware used for training
4. A human researcher who helps train the model
In Mixtral 8x7B, only 2 experts activate for each token processed. If the model has 8 experts total, what fraction activates per token?
1. 1/8 or 12.5%
2. 2/8 or 25%
3. 1/4 or 25%
4. 1/2 or 50%
Why is a Mixture of Experts model faster than a traditional model of similar quality?
1. Experts work in parallel on separate servers
2. The model skips tokens it considers unimportant
3. Only a small subset of experts are activated for each token, reducing computation
4. MoE models use more powerful GPUs than regular models
DeepSeek V3 has 671 billion total parameters but activates only 37 billion per token. What is the approximate ratio of active to total parameters?
1. About 95%
2. About 18%
3. About 50%
4. About 5.4%
What does it mean that GPT-4 is 'widely believed' to be a Mixture of Experts model?
1. Everyone agrees it is definitely an MoE model
2. No researchers believe it is an MoE model
3. Researchers have strong evidence but OpenAI has not confirmed it
4. OpenAI officially announced it in a press release
A model is described as having '1 trillion parameters total but only using 30 billion per word.' What benefit does this provide?
1. The model will have better memory
2. The model will produce more creative outputs
3. The model can be very capable while staying fast and affordable to run
4. The model requires less training data
In the Mixtral example, the model 'looks like a 47B model but runs like a 13B.' What does this comparison mean?
1. It has 47 experts but only 13 are useful
2. It has 47 billion parameters but only 13 work at a time
3. Its quality matches a 47B model but its speed matches a 13B model
4. It was trained on 47 datasets but only uses 13
What is the 'efficiency trick' that Mixture of Experts models use?
1. Running on cheaper, lower-quality hardware
2. Activating only a small number of experts per token instead of the whole model
3. Using shorter words in the training data
4. Compressing all parameters into a smaller file
If an MoE model has many experts but only uses 2-4 per token, what happens to the unused experts?
1. They are deleted from the model
2. They consume as much energy as active experts
3. They are turned off permanently
4. They remain available for other tokens but sit idle for this one
What would likely happen to inference speed if an MoE model activated ALL its experts for every token?
1. It would generate more creative responses
2. It would use less memory
3. It would become very slow, nearly as slow as running a giant single model
4. It would become more accurate
The lesson states that MoE is the reason models keep getting smarter without inference getting slower. What does 'inference' mean here?
1. The process of training the model on data
2. The process of collecting training data
3. The process of generating new text after training is complete
4. The process of testing the model before release
Why do AI companies use Mixture of Experts architecture even though it adds complexity?
1. It allows the model to run on phone batteries
2. It reduces the amount of training data needed
3. It makes the model easier to explain to users
4. It lets them build more capable models without making inference prohibitively expensive
Based on the lesson, what pricing trend in 2026 does MoE architecture help explain?
1. AI services becoming cheaper despite models getting more capable
2. AI services becoming less accurate
3. AI services requiring more memory
4. AI services requiring longer wait times
What distinguishes Mixtral 8x7B from a model that is simply '8 times bigger' than a 7B parameter model?
1. Mixtral was trained on 8 times more data
2. Mixtral has 8 separate expert networks but only uses 2 at a time
3. Mixtral is actually 8 separate models glued together
4. Mixtral has 8 different layers
If you were building a new frontier model in 2025 and wanted it to be very capable while keeping inference costs reasonable, what architecture would you likely choose?
1. A model without any neural networks
2. A very small model with limited knowledge
3. A single large traditional network
4. Mixture of Experts

← Back to interactive lesson

Tendril · Builders · Model Families

Mixture of Experts — Why GPT-4 Is Smarter Than It Looks

MoE models route each token to a 'specialist' sub-network — same total size, way more efficient.

27 min · Reviewed 2026

The big idea

Some examples

Mixtral 8x7B: 8 experts, picks 2 per token. Looks like a 47B model, runs like a 13B.
GPT-4 is widely believed to be MoE based on inference behavior (OpenAI hasn't confirmed).
DeepSeek V3 has 671B total parameters but activates only 37B per token.
Llama 4 went MoE in 2025 to keep up with this efficiency trick.

Try it!

Read the Mixtral paper's intro on Mistral's website. Notice the math: total params vs active params.

Why Mixture-of-Experts Models Like Mixtral Run Cheaper Than They Look

The big idea

Some examples

Mixtral 8x7B has 47B total params but uses only ~13B per token at inference.
DeepSeek V3's MoE design lets it match GPT-4-class quality at a fraction of the serving cost.
Mistral's MoE variants run faster than dense models of similar quality on the same GPU.
Some Llama variants now use MoE to compete with closed models cost-effectively.

Try it!

Read the spec page for Mixtral or DeepSeek V3. Note the gap between total and active parameters. That's the trick.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-builders-models-mixture-of-experts-r7a8-teen

In a Mixture of Experts AI model, what is an 'expert'?
1. A single parameter that stores one piece of information
2. A specialized sub-network that handles certain types of tokens or patterns
3. A type of GPU hardware used for training
4. A human researcher who helps train the model
In Mixtral 8x7B, only 2 experts activate for each token processed. If the model has 8 experts total, what fraction activates per token?
1. 1/8 or 12.5%
2. 2/8 or 25%
3. 1/4 or 25%
4. 1/2 or 50%
Why is a Mixture of Experts model faster than a traditional model of similar quality?
1. Experts work in parallel on separate servers
2. The model skips tokens it considers unimportant
3. Only a small subset of experts are activated for each token, reducing computation
4. MoE models use more powerful GPUs than regular models
DeepSeek V3 has 671 billion total parameters but activates only 37 billion per token. What is the approximate ratio of active to total parameters?
1. About 95%
2. About 18%
3. About 50%
4. About 5.4%
What does it mean that GPT-4 is 'widely believed' to be a Mixture of Experts model?
1. Everyone agrees it is definitely an MoE model
2. No researchers believe it is an MoE model
3. Researchers have strong evidence but OpenAI has not confirmed it
4. OpenAI officially announced it in a press release
A model is described as having '1 trillion parameters total but only using 30 billion per word.' What benefit does this provide?
1. The model will have better memory
2. The model will produce more creative outputs
3. The model can be very capable while staying fast and affordable to run
4. The model requires less training data
In the Mixtral example, the model 'looks like a 47B model but runs like a 13B.' What does this comparison mean?
1. It has 47 experts but only 13 are useful
2. It has 47 billion parameters but only 13 work at a time
3. Its quality matches a 47B model but its speed matches a 13B model
4. It was trained on 47 datasets but only uses 13
What is the 'efficiency trick' that Mixture of Experts models use?
1. Running on cheaper, lower-quality hardware
2. Activating only a small number of experts per token instead of the whole model
3. Using shorter words in the training data
4. Compressing all parameters into a smaller file
If an MoE model has many experts but only uses 2-4 per token, what happens to the unused experts?
1. They are deleted from the model
2. They consume as much energy as active experts
3. They are turned off permanently
4. They remain available for other tokens but sit idle for this one
What would likely happen to inference speed if an MoE model activated ALL its experts for every token?
1. It would generate more creative responses
2. It would use less memory
3. It would become very slow, nearly as slow as running a giant single model
4. It would become more accurate
The lesson states that MoE is the reason models keep getting smarter without inference getting slower. What does 'inference' mean here?
1. The process of training the model on data
2. The process of collecting training data
3. The process of generating new text after training is complete
4. The process of testing the model before release
Why do AI companies use Mixture of Experts architecture even though it adds complexity?
1. It allows the model to run on phone batteries
2. It reduces the amount of training data needed
3. It makes the model easier to explain to users
4. It lets them build more capable models without making inference prohibitively expensive
Based on the lesson, what pricing trend in 2026 does MoE architecture help explain?
1. AI services becoming cheaper despite models getting more capable
2. AI services becoming less accurate
3. AI services requiring more memory
4. AI services requiring longer wait times
What distinguishes Mixtral 8x7B from a model that is simply '8 times bigger' than a 7B parameter model?
1. Mixtral was trained on 8 times more data
2. Mixtral has 8 separate expert networks but only uses 2 at a time
3. Mixtral is actually 8 separate models glued together
4. Mixtral has 8 different layers
If you were building a new frontier model in 2025 and wanted it to be very capable while keeping inference costs reasonable, what architecture would you likely choose?
1. A model without any neural networks
2. A very small model with limited knowledge
3. A single large traditional network
4. Mixture of Experts

← Back to interactive lesson