The Environmental Cost of Training a Big Model

Training a frontier model uses the electricity of a small city for months. Running inference at scale matches a large country's load. Here is what the numbers actually look like.

25 min · Reviewed 2026

Compute Has a Physical Footprint

Every token your favorite AI generates came out of a GPU that was drawing electricity from a grid that was, somewhere upstream, burning something or spinning something. The abstraction is clean. The physics is not.

Training numbers you should actually know

GPT-3 (2020): estimated ~1,287 MWh to train, roughly 500 metric tons of CO2
GPT-4 class (2023): publicly estimated at tens of thousands of MWh
Frontier 2025 training runs: hundreds of GWh, matching small-city annual consumption
Llama 3 70B: Meta disclosed ~1,900 MWh
Training electricity is a one-time cost — inference is forever

Inference is the real long-term story

Training a model takes weeks or months and then stops. Running it serves billions of users every day for years. By 2025, most analyses estimated that inference consumed more energy than training across the industry. The IEA projected that global data center electricity use could reach 945 TWh by 2030, with AI as the fastest-growing slice.

Water is the other quiet variable

Data centers are cooled by enormous amounts of water. A 2023 study estimated GPT-3 training consumed about 700,000 liters of fresh water. Microsoft's water use jumped 34 percent from 2021 to 2022, partly attributed to AI. In drought-prone regions like Arizona, this has become a political fight.

Compare: common computing footprints

Activity	Rough energy cost
Google search	~0.3 Wh
LLM chat turn	~3-10 Wh
Image generation	~30-100 Wh
Video generation (per second)	~300-1000 Wh
Train GPT-3 once	~1.3 GWh
Train frontier 2025 model	100+ GWh

What is changing

Model efficiency: smaller specialist models often beat giant ones per task
Sparse mixture of experts: only activate parts of a model per query
Custom silicon: TPUs, Trainium, Cerebras, Groq all target more ops per watt
Renewable siting: data centers are chasing hydro, geothermal, and nuclear
Nuclear renaissance: Amazon, Google, Microsoft all signed nuclear power deals in 2024-2025

What is not changing

Demand is growing faster than efficiency
New AI features create new inference load, not less
Training runs are getting bigger, not smaller
Grid buildout takes a decade; model training takes months

You cannot compute your way out of thermodynamics. Somebody always pays the electric bill.
— A data center engineer

The big idea: AI has real physical costs that are often hidden behind a clean chat interface. Whether they are worth it is a judgment call — but you cannot make the call without knowing the numbers.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-ethics-environmental-cost-builders

What is the core idea behind "The Environmental Cost of Training a Big Model"?
1. Training a frontier model uses the electricity of a small city for months. Running inference at scale matches a large country's load. Here is what the numbers actually look like.
2. Coordinate with peer customers when collective action helps
3. Human oversight paths that do not require a credit card
4. You can use AI to help someone else — like writing a kind message for a friend w…
Which term best describes a foundational idea in "The Environmental Cost of Training a Big Model"?
1. inference
2. training compute
3. PUE
4. data center
A learner studying The Environmental Cost of Training a Big Model would need to understand which concept?
1. training compute
2. PUE
3. inference
4. data center
Which of these is directly relevant to The Environmental Cost of Training a Big Model?
1. training compute
2. inference
3. data center
4. PUE
Which of the following is a key point about The Environmental Cost of Training a Big Model?
1. GPT-3 (2020): estimated ~1,287 MWh to train, roughly 500 metric tons of CO2
2. GPT-4 class (2023): publicly estimated at tens of thousands of MWh
3. Frontier 2025 training runs: hundreds of GWh, matching small-city annual consumption
4. Llama 3 70B: Meta disclosed ~1,900 MWh
Which of these does NOT belong in a discussion of The Environmental Cost of Training a Big Model?
1. GPT-3 (2020): estimated ~1,287 MWh to train, roughly 500 metric tons of CO2
2. Coordinate with peer customers when collective action helps
3. Frontier 2025 training runs: hundreds of GWh, matching small-city annual consumption
4. GPT-4 class (2023): publicly estimated at tens of thousands of MWh
Which statement is accurate regarding The Environmental Cost of Training a Big Model?
1. Sparse mixture of experts: only activate parts of a model per query
2. Custom silicon: TPUs, Trainium, Cerebras, Groq all target more ops per watt
3. Model efficiency: smaller specialist models often beat giant ones per task
4. Renewable siting: data centers are chasing hydro, geothermal, and nuclear
Which of these does NOT belong in a discussion of The Environmental Cost of Training a Big Model?
1. Custom silicon: TPUs, Trainium, Cerebras, Groq all target more ops per watt
2. Model efficiency: smaller specialist models often beat giant ones per task
3. Sparse mixture of experts: only activate parts of a model per query
4. Coordinate with peer customers when collective action helps
What is the key insight about "Per-query numbers" in the context of The Environmental Cost of Training a Big Model?
1. A ChatGPT query uses roughly 10x the electricity of a Google search, at rough industry estimates.
2. Coordinate with peer customers when collective action helps
3. Human oversight paths that do not require a credit card
4. You can use AI to help someone else — like writing a kind message for a friend w…
What is the key insight about "Reporting quality is poor" in the context of The Environmental Cost of Training a Big Model?
1. Coordinate with peer customers when collective action helps
2. Most labs disclose almost nothing about training energy. Public estimates are triangulated from GPU counts and runtime.
3. Human oversight paths that do not require a credit card
4. You can use AI to help someone else — like writing a kind message for a friend w…
What is the recommended tip about "Key insight" in the context of The Environmental Cost of Training a Big Model?
1. Coordinate with peer customers when collective action helps
2. Human oversight paths that do not require a credit card
3. Training a frontier model uses the electricity of a small city for months.
4. You can use AI to help someone else — like writing a kind message for a friend w…
Which statement accurately describes an aspect of The Environmental Cost of Training a Big Model?
1. Coordinate with peer customers when collective action helps
2. Human oversight paths that do not require a credit card
3. You can use AI to help someone else — like writing a kind message for a friend w…
4. Every token your favorite AI generates came out of a GPU that was drawing electricity from a grid that was, somewhere upstream, burning some…
What does working with The Environmental Cost of Training a Big Model typically involve?
1. Training a model takes weeks or months and then stops. Running it serves billions of users every day for years.
2. Coordinate with peer customers when collective action helps
3. Human oversight paths that do not require a credit card
4. You can use AI to help someone else — like writing a kind message for a friend w…
Which of the following is true about The Environmental Cost of Training a Big Model?
1. Coordinate with peer customers when collective action helps
2. Data centers are cooled by enormous amounts of water. A 2023 study estimated GPT-3 training consumed about 700,000 liters of fresh water.
3. Human oversight paths that do not require a credit card
4. You can use AI to help someone else — like writing a kind message for a friend w…
Which best describes the scope of "The Environmental Cost of Training a Big Model"?
1. It is unrelated to ethics workflows
2. It applies only to the opposite beginner tier
3. It focuses on Training a frontier model uses the electricity of a small city for months. Running inference at scal
4. It was deprecated in 2024 and no longer relevant

← Back to interactive lesson

Tendril · Builders · Ethics & Society