Scaling Laws and Compute-Optimal Training

Dive into the equations that governed the last five years of AI progress, and the fresh questions they raise now that pure scaling is hitting walls.

50 min · Reviewed 2026

From Empirical Curve to Strategic Doctrine

Kaplan et al. (2020) showed that LLM loss follows smooth power laws in parameters, data, and compute. Hoffmann et al. (2022, the Chinchilla paper) showed that under a compute budget, you want to balance N and D rather than scaling parameters alone. These two papers shaped over a hundred billion dollars of capex.

The Kaplan power law

Loss L as a function of compute C behaves approximately like L(C) ≈ a · C^(-α) for some constants. Separately for parameters N and data D with their own exponents. The exponent α is small (around 0.05 to 0.1), meaning you need large increases in compute to get modest loss drops.

# Rough Chinchilla-optimal scaling
def compute_optimal(flops):
    # From Hoffmann et al. (2022):
    # Optimal parameters N ≈ 0.6 * sqrt(C / 6)
    # Optimal tokens    D ≈ C / (6 * N)
    N = 0.6 * (flops / 6) ** 0.5
    D = flops / (6 * N)
    return N, D

N, D = compute_optimal(1e24)  # 1e24 FLOPs budget
print(f"Params: {N:.2e}, Tokens: {D:.2e}")
# Around 20 tokens per parameterThe rule of thumb that corrected the over-parameterized era.

Why Chinchilla mattered

Era	Model	Parameters	Tokens per param
Pre-Chinchilla	GPT-3	175B	~1.7
Chinchilla paper	Chinchilla 70B	70B	~20
Llama 2 era	Llama 2 70B	70B	~30
Llama 3 / frontier	Llama 3 70B	70B	~210
Over-training era	Small dense models	1-8B	hundreds+

Many modern models deliberately over-train on data well past the Chinchilla-optimal ratio. The reason: inference cost dominates total cost once a model is deployed to millions of users. A smaller, over-trained model is cheaper to serve and still captures most of the quality.

The pivot to test-time compute

By 2024-2025, pure pretraining scaling showed diminishing returns. Spending the same compute at inference — having the model think longer, sample many answers, verify, backtrack — unlocked further gains. OpenAI's o1 and o3 series, Anthropic's extended thinking, and DeepSeek's R1 all embody this inference-compute scaling law.

Bottlenecks beyond FLOPs

Data: public high-quality text is mostly scraped. Multimodal, synthetic, and licensed data fill gaps.
Energy: frontier training runs approach gigawatt scales
Talent: the real throttle on how many frontier labs can exist
Alignment: as capability scales, so does the cost of shaping behavior

What this means for your work

If you fine-tune, start with a compute-optimal base, not the largest
Benchmarks at one size often do not predict another size
Budget test-time compute explicitly in your product design
Watch for regimes where smaller, smarter beats bigger, dumber

The exponent is small. The money spent chasing it is not.
— An infra lead at a frontier lab

The big idea: scaling laws made AI progress predictable. The current frontier is learning where the curves bend, and whether new algorithms can steepen them again.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-scaling-and-compute-optimal-training

What is the core idea behind "Scaling Laws and Compute-Optimal Training"?
1. Dive into the equations that governed the last five years of AI progress, and the fresh questions they raise now that pure scaling is hitting walls.
2. test
3. Perplexity uses RAG — it searches before answering.
4. Apply MQA in your foundations workflow to get better results
Which term best describes a foundational idea in "Scaling Laws and Compute-Optimal Training"?
1. Chinchilla
2. power law
3. compute-optimal
4. test-time compute
A learner studying Scaling Laws and Compute-Optimal Training would need to understand which concept?
1. power law
2. compute-optimal
3. Chinchilla
4. test-time compute
Which of these is directly relevant to Scaling Laws and Compute-Optimal Training?
1. power law
2. Chinchilla
3. test-time compute
4. compute-optimal
Which of the following is a key point about Scaling Laws and Compute-Optimal Training?
1. Data: public high-quality text is mostly scraped. Multimodal, synthetic, and licensed data fill gaps.
2. Energy: frontier training runs approach gigawatt scales
3. Talent: the real throttle on how many frontier labs can exist
4. Alignment: as capability scales, so does the cost of shaping behavior
Which of these does NOT belong in a discussion of Scaling Laws and Compute-Optimal Training?
1. Talent: the real throttle on how many frontier labs can exist
2. test
3. Energy: frontier training runs approach gigawatt scales
4. Data: public high-quality text is mostly scraped. Multimodal, synthetic, and licensed data fill gaps.
Which statement is accurate regarding Scaling Laws and Compute-Optimal Training?
1. Benchmarks at one size often do not predict another size
2. Budget test-time compute explicitly in your product design
3. If you fine-tune, start with a compute-optimal base, not the largest
4. Watch for regimes where smaller, smarter beats bigger, dumber
Which of these does NOT belong in a discussion of Scaling Laws and Compute-Optimal Training?
1. test
2. If you fine-tune, start with a compute-optimal base, not the largest
3. Budget test-time compute explicitly in your product design
4. Benchmarks at one size often do not predict another size
What is the key insight about "Two axes now" in the context of Scaling Laws and Compute-Optimal Training?
1. Training-time compute buys raw capability. Test-time compute buys reliability on hard problems. You can trade them.
2. test
3. Perplexity uses RAG — it searches before answering.
4. Apply MQA in your foundations workflow to get better results
What is the key insight about "Extrapolation hazard" in the context of Scaling Laws and Compute-Optimal Training?
1. test
2. Power laws held for a while. They do not have to hold forever.
3. Perplexity uses RAG — it searches before answering.
4. Apply MQA in your foundations workflow to get better results
What is the recommended tip about "Ground your practice in fundamentals" in the context of Scaling Laws and Compute-Optimal Training?
1. test
2. Perplexity uses RAG — it searches before answering.
3. Every AI capability has an underlying mechanism. Understanding that mechanism tells you where it'll fail — which is more…
4. Apply MQA in your foundations workflow to get better results
Which statement accurately describes an aspect of Scaling Laws and Compute-Optimal Training?
1. test
2. Perplexity uses RAG — it searches before answering.
3. Apply MQA in your foundations workflow to get better results
4. Kaplan et al. (2020) showed that LLM loss follows smooth power laws in parameters, data, and compute. Hoffmann et al.
What does working with Scaling Laws and Compute-Optimal Training typically involve?
1. Loss L as a function of compute C behaves approximately like L(C) ≈ a · C^(-α) for some constants.
2. test
3. Perplexity uses RAG — it searches before answering.
4. Apply MQA in your foundations workflow to get better results
Which of the following is true about Scaling Laws and Compute-Optimal Training?
1. test
2. Many modern models deliberately over-train on data well past the Chinchilla-optimal ratio.
3. Perplexity uses RAG — it searches before answering.
4. Apply MQA in your foundations workflow to get better results
Which best describes the scope of "Scaling Laws and Compute-Optimal Training"?
1. It is unrelated to foundations workflows
2. It applies only to the opposite beginner tier
3. It focuses on Dive into the equations that governed the last five years of AI progress, and the fresh questions th
4. It was deprecated in 2024 and no longer relevant

← Back to interactive lesson

Tendril · Creators · AI Foundations

Scaling Laws and Compute-Optimal Training

Dive into the equations that governed the last five years of AI progress, and the fresh questions they raise now that pure scaling is hitting walls.

50 min · Reviewed 2026

From Empirical Curve to Strategic Doctrine

The Kaplan power law

# Rough Chinchilla-optimal scaling
def compute_optimal(flops):
    # From Hoffmann et al. (2022):
    # Optimal parameters N ≈ 0.6 * sqrt(C / 6)
    # Optimal tokens    D ≈ C / (6 * N)
    N = 0.6 * (flops / 6) ** 0.5
    D = flops / (6 * N)
    return N, D

N, D = compute_optimal(1e24)  # 1e24 FLOPs budget
print(f"Params: {N:.2e}, Tokens: {D:.2e}")
# Around 20 tokens per parameterThe rule of thumb that corrected the over-parameterized era.

Why Chinchilla mattered

Era	Model	Parameters	Tokens per param
Pre-Chinchilla	GPT-3	175B	~1.7
Chinchilla paper	Chinchilla 70B	70B	~20
Llama 2 era	Llama 2 70B	70B	~30
Llama 3 / frontier	Llama 3 70B	70B	~210
Over-training era	Small dense models	1-8B	hundreds+

The pivot to test-time compute

Bottlenecks beyond FLOPs

Data: public high-quality text is mostly scraped. Multimodal, synthetic, and licensed data fill gaps.
Energy: frontier training runs approach gigawatt scales
Talent: the real throttle on how many frontier labs can exist
Alignment: as capability scales, so does the cost of shaping behavior

What this means for your work

If you fine-tune, start with a compute-optimal base, not the largest
Benchmarks at one size often do not predict another size
Budget test-time compute explicitly in your product design
Watch for regimes where smaller, smarter beats bigger, dumber

The exponent is small. The money spent chasing it is not.
— An infra lead at a frontier lab

The big idea: scaling laws made AI progress predictable. The current frontier is learning where the curves bend, and whether new algorithms can steepen them again.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-scaling-and-compute-optimal-training

What is the core idea behind "Scaling Laws and Compute-Optimal Training"?
1. Dive into the equations that governed the last five years of AI progress, and the fresh questions they raise now that pure scaling is hitting walls.
2. test
3. Perplexity uses RAG — it searches before answering.
4. Apply MQA in your foundations workflow to get better results
Which term best describes a foundational idea in "Scaling Laws and Compute-Optimal Training"?
1. Chinchilla
2. power law
3. compute-optimal
4. test-time compute
A learner studying Scaling Laws and Compute-Optimal Training would need to understand which concept?
1. power law
2. compute-optimal
3. Chinchilla
4. test-time compute
Which of these is directly relevant to Scaling Laws and Compute-Optimal Training?
1. power law
2. Chinchilla
3. test-time compute
4. compute-optimal
Which of the following is a key point about Scaling Laws and Compute-Optimal Training?
1. Data: public high-quality text is mostly scraped. Multimodal, synthetic, and licensed data fill gaps.
2. Energy: frontier training runs approach gigawatt scales
3. Talent: the real throttle on how many frontier labs can exist
4. Alignment: as capability scales, so does the cost of shaping behavior
Which of these does NOT belong in a discussion of Scaling Laws and Compute-Optimal Training?
1. Talent: the real throttle on how many frontier labs can exist
2. test
3. Energy: frontier training runs approach gigawatt scales
4. Data: public high-quality text is mostly scraped. Multimodal, synthetic, and licensed data fill gaps.
Which statement is accurate regarding Scaling Laws and Compute-Optimal Training?
1. Benchmarks at one size often do not predict another size
2. Budget test-time compute explicitly in your product design
3. If you fine-tune, start with a compute-optimal base, not the largest
4. Watch for regimes where smaller, smarter beats bigger, dumber
Which of these does NOT belong in a discussion of Scaling Laws and Compute-Optimal Training?
1. test
2. If you fine-tune, start with a compute-optimal base, not the largest
3. Budget test-time compute explicitly in your product design
4. Benchmarks at one size often do not predict another size
What is the key insight about "Two axes now" in the context of Scaling Laws and Compute-Optimal Training?
1. Training-time compute buys raw capability. Test-time compute buys reliability on hard problems. You can trade them.
2. test
3. Perplexity uses RAG — it searches before answering.
4. Apply MQA in your foundations workflow to get better results
What is the key insight about "Extrapolation hazard" in the context of Scaling Laws and Compute-Optimal Training?
1. test
2. Power laws held for a while. They do not have to hold forever.
3. Perplexity uses RAG — it searches before answering.
4. Apply MQA in your foundations workflow to get better results
What is the recommended tip about "Ground your practice in fundamentals" in the context of Scaling Laws and Compute-Optimal Training?
1. test
2. Perplexity uses RAG — it searches before answering.
3. Every AI capability has an underlying mechanism. Understanding that mechanism tells you where it'll fail — which is more…
4. Apply MQA in your foundations workflow to get better results
Which statement accurately describes an aspect of Scaling Laws and Compute-Optimal Training?
1. test
2. Perplexity uses RAG — it searches before answering.
3. Apply MQA in your foundations workflow to get better results
4. Kaplan et al. (2020) showed that LLM loss follows smooth power laws in parameters, data, and compute. Hoffmann et al.
What does working with Scaling Laws and Compute-Optimal Training typically involve?
1. Loss L as a function of compute C behaves approximately like L(C) ≈ a · C^(-α) for some constants.
2. test
3. Perplexity uses RAG — it searches before answering.
4. Apply MQA in your foundations workflow to get better results
Which of the following is true about Scaling Laws and Compute-Optimal Training?
1. test
2. Many modern models deliberately over-train on data well past the Chinchilla-optimal ratio.
3. Perplexity uses RAG — it searches before answering.
4. Apply MQA in your foundations workflow to get better results
Which best describes the scope of "Scaling Laws and Compute-Optimal Training"?
1. It is unrelated to foundations workflows
2. It applies only to the opposite beginner tier
3. It focuses on Dive into the equations that governed the last five years of AI progress, and the fresh questions th
4. It was deprecated in 2024 and no longer relevant

← Back to interactive lesson