Mixture of Depths: How AI Models Spend Compute Per Token
Mixture-of-depths lets models skip layers per token to spend compute where it matters; understand it to evaluate efficiency claims honestly.
33 min · Reviewed 2026
The premise
Mixture-of-depths trains a router to skip transformer layers on easy tokens, spending compute where input difficulty actually demands it.
What AI does well here
Reduce average compute per token while preserving downstream quality
Concentrate compute on tokens with high routing-uncertainty
Compose with mixture-of-experts for additional efficiency gains
What AI cannot do
Match dense-model quality on adversarial tail-of-distribution tasks always
Avoid additional engineering complexity in serving systems
Replace the need for careful router-training data
End-of-lesson check
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-foundations-ai-mixture-of-depths-r8a4-creators
In a mixture-of-depths architecture, what determines whether a particular token traverses all transformer layers or skips some?
A dedicated router network that learns to predict which layers are unnecessary for each token
A fixed pattern that alternates between active and skipped layers
The total compute budget allocated before inference begins
The length of the input sequence measured in tokens
Why do researchers recommend including p95 and p99 latency metrics when evaluating mixture-of-depths efficiency claims?
They measure the total energy consumption of the model
Average latency can hide cases where certain tokens require significantly more compute, creating tail-latency surprises
These metrics are required by academic publication standards
Average latency alone already captures all relevant performance characteristics
What is a degenerate mixture-of-depths model and how does it occur?
A model that only processes the first token in each sequence
A model that fails to generate coherent output
A model where the router consistently skips either almost no layers or almost every layer, losing the efficiency benefits
A model with fewer parameters than its dense counterpart
Mixture-of-depths can be combined with mixture-of-experts (MoE) to achieve additional efficiency gains. What characteristic do both techniques share?
Both can only be applied to the attention mechanism, not feed-forward networks
Both require doubling the number of parameters in the model
Both eliminate the need for any routing mechanisms
Both selectively activate only a portion of the model's total capacity for each input
On which types of tokens does a mixture-of-depths router typically concentrate compute?
Tokens that are identical to previously processed tokens
Tokens that appear at the beginning of every input sequence
Tokens where the router has high uncertainty about which layers to apply
Tokens containing numerical data only
What limitation of mixture-of-depths is most relevant when deploying models in real-time applications with strict latency deadlines?
The inability to process more than 512 tokens in a single sequence
The need for larger memory buffers to store routing decisions
The requirement for specialized hardware that most servers lack
The difficulty in guaranteeing consistent latency due to variable layer skipping per token
What training signal is recommended as a first-class metric when training a mixture-of-depths router?
The temperature setting used during sampling
The number of tokens processed per second
The total loss of the model during backpropagation
Routing entropy, which measures the unpredictability of the router's layer selections
Which statement accurately describes what mixture-of-depths cannot reliably achieve, even when working correctly?
Replacing the need for any model training data
Matching dense-model quality on adversarial inputs drawn from the tail of the distribution
Eliminating all engineering complexity from the serving infrastructure
Processing all tokens with exactly the same number of FLOPs
What engineering challenge is introduced specifically by adding mixture-of-depths to a serving system?
Reduced compatibility with existing model serialization formats
Increased difficulty in batching multiple requests together
Additional complexity in managing variable-length per-token computation paths
The need to store routing decisions for regulatory compliance
A researcher claims their mixture-of-depths model uses 40% less compute on average while maintaining equal quality to a dense model. What additional information would you need to critically evaluate this claim?
The programming language used to implement the router
The number of tokens in their test vocabulary
The exact color scheme used in their demo videos
Latency distributions including p95 and p99 values to check for tail-latency issues
What would monitoring a router's routing entropy reveal about the model's behavior?
The exact energy consumption of each inference request
The speed of token generation during inference
Whether the router is making diverse layer-skipping decisions or has become degenerate
The total number of parameters in the model
In mixture-of-depths, what distinguishes 'easy' tokens from 'difficult' tokens from the router's perspective?
The alphabetical position of the first letter in the token
The absolute length of the word or subword in the token
The presence or absence of punctuation in the token
The router's learned ability to predict which tokens can be processed with fewer layers without quality loss
What is the primary efficiency metric that mixture-of-depths aims to reduce?
The latency of the very first token in any sequence
The size of the training dataset required
The total number of model parameters
Average compute per token while preserving downstream output quality
What happens when a mixture-of-depths router encounters inputs from the tail of the distribution (rare or adversarial cases)?
The model automatically switches to a different architecture
The router automatically routes all such tokens through every layer
The router may make poor layer-skipping decisions, potentially reducing quality compared to a dense model
The router crashes and returns an error message
Which architectural concept is most closely related to mixture-of-depths?
Convolutional neural networks that use sliding window operations
Conditional compute, where computation is dynamically allocated based on input characteristics
Long short-term memory networks with gating mechanisms
Recurrent neural networks that process sequences one element at a time