Scaling Laws and Compute-Optimal Training

Section 1

From Empirical Curve to Strategic Doctrine

The rule of thumb that corrected the over-parameterized era.

python

# Rough Chinchilla-optimal scaling
def compute_optimal(flops):
    # From Hoffmann et al. (2022):
    # Optimal parameters N ≈ 0.6 * sqrt(C / 6)
    # Optimal tokens    D ≈ C / (6 * N)
    N = 0.6 * (flops / 6) ** 0.5
    D = flops / (6 * N)
    return N, D

N, D = compute_optimal(1e24)  # 1e24 FLOPs budget
print(f"Params: {N:.2e}, Tokens: {D:.2e}")
# Around 20 tokens per parameter

Compare the options

Era	Model	Parameters	Tokens per param
Pre-Chinchilla	GPT-3	175B	~1.7
Chinchilla paper	Chinchilla 70B	70B	~20
Llama 2 era	Llama 2 70B	70B	~30
Llama 3 / frontier	Llama 3 70B	70B	~210
Over-training era	Small dense models	1-8B	hundreds+

Key terms in this lesson

Scaling Laws and Compute-Optimal Training

From Empirical Curve to Strategic Doctrine

The Kaplan power law

Why Chinchilla mattered

The pivot to test-time compute

Bottlenecks beyond FLOPs

What this means for your work

Curious about “Scaling Laws and Compute-Optimal Training”?

Keep going

Scaling Laws and Compute-Optimal Training

From Empirical Curve to Strategic Doctrine

The Kaplan power law

Why Chinchilla mattered

The pivot to test-time compute

Bottlenecks beyond FLOPs

What this means for your work

Curious about “Scaling Laws and Compute-Optimal Training”?

Keep going