Loading lesson…
Dive into the equations that governed the last five years of AI progress, and the fresh questions they raise now that pure scaling is hitting walls.
Kaplan et al. (2020) showed that LLM loss follows smooth power laws in parameters, data, and compute. Hoffmann et al. (2022, the Chinchilla paper) showed that under a compute budget, you want to balance N and D rather than scaling parameters alone. These two papers shaped over a hundred billion dollars of capex.
Loss L as a function of compute C behaves approximately like L(C) ≈ a · C^(-α) for some constants. Separately for parameters N and data D with their own exponents. The exponent α is small (around 0.05 to 0.1), meaning you need large increases in compute to get modest loss drops.
# Rough Chinchilla-optimal scaling
def compute_optimal(flops):
# From Hoffmann et al. (2022):
# Optimal parameters N ≈ 0.6 * sqrt(C / 6)
# Optimal tokens D ≈ C / (6 * N)
N = 0.6 * (flops / 6) ** 0.5
D = flops / (6 * N)
return N, D
N, D = compute_optimal(1e24) # 1e24 FLOPs budget
print(f"Params: {N:.2e}, Tokens: {D:.2e}")
# Around 20 tokens per parameterThe rule of thumb that corrected the over-parameterized era.| Era | Model | Parameters | Tokens per param |
|---|---|---|---|
| Pre-Chinchilla | GPT-3 | 175B | ~1.7 |
| Chinchilla paper | Chinchilla 70B | 70B | ~20 |
| Llama 2 era | Llama 2 70B | 70B | ~30 |
| Llama 3 / frontier | Llama 3 70B | 70B | ~210 |
| Over-training era | Small dense models | 1-8B | hundreds+ |
Many modern models deliberately over-train on data well past the Chinchilla-optimal ratio. The reason: inference cost dominates total cost once a model is deployed to millions of users. A smaller, over-trained model is cheaper to serve and still captures most of the quality.
By 2024-2025, pure pretraining scaling showed diminishing returns. Spending the same compute at inference — having the model think longer, sample many answers, verify, backtrack — unlocked further gains. OpenAI's o1 and o3 series, Anthropic's extended thinking, and DeepSeek's R1 all embody this inference-compute scaling law.
The exponent is small. The money spent chasing it is not.
— An infra lead at a frontier lab
The big idea: scaling laws made AI progress predictable. The current frontier is learning where the curves bend, and whether new algorithms can steepen them again.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-scaling-and-compute-optimal-training
What is the core idea behind "Scaling Laws and Compute-Optimal Training"?
Which term best describes a foundational idea in "Scaling Laws and Compute-Optimal Training"?
A learner studying Scaling Laws and Compute-Optimal Training would need to understand which concept?
Which of these is directly relevant to Scaling Laws and Compute-Optimal Training?
Which of the following is a key point about Scaling Laws and Compute-Optimal Training?
Which of these does NOT belong in a discussion of Scaling Laws and Compute-Optimal Training?
Which statement is accurate regarding Scaling Laws and Compute-Optimal Training?
Which of these does NOT belong in a discussion of Scaling Laws and Compute-Optimal Training?
What is the key insight about "Two axes now" in the context of Scaling Laws and Compute-Optimal Training?
What is the key insight about "Extrapolation hazard" in the context of Scaling Laws and Compute-Optimal Training?
What is the recommended tip about "Ground your practice in fundamentals" in the context of Scaling Laws and Compute-Optimal Training?
Which statement accurately describes an aspect of Scaling Laws and Compute-Optimal Training?
What does working with Scaling Laws and Compute-Optimal Training typically involve?
Which of the following is true about Scaling Laws and Compute-Optimal Training?
Which best describes the scope of "Scaling Laws and Compute-Optimal Training"?