Reasoning Models: OpenAI o1 and After

In 2024, a new class of models traded fast answers for slow, deliberate thinking, and benchmarks jumped.

25 min · Reviewed 2026

Thinking Longer, On Purpose

In September 2024, OpenAI previewed o1, a model that spent extra compute before answering, generating long internal chains of reasoning. On hard math, coding, and science benchmarks, o1 leapt past GPT-4o, sometimes by double-digit points on tests where progress had been crawling.

The core idea was not prompt-level chain of thought. It was training the model, often through reinforcement learning, to use its own thinking tokens effectively, and then letting it spend as many of those tokens as needed at inference time.

What reasoning models do well

Multi-step math and proofs where intermediate errors compound
Competitive programming problems requiring search
Scientific reasoning on benchmarks like GPQA Diamond
Agentic tasks that benefit from planning and reflection

Competitors followed quickly. Google's Gemini 2.0 Flash Thinking, DeepSeek's R1 in early 2025, and Anthropic's extended thinking mode all adopted variants of the paradigm. Some published training recipes openly; others kept them secret.

We've developed a new series of AI models designed to spend more time thinking before they respond.
— OpenAI, o1 announcement, 2024

The big idea: reasoning models reopened the scaling frontier by moving compute from training time into inference time. A model that can think longer is a different kind of model.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-history-reasoning-models-builders

What makes OpenAI o1 different from earlier GPT models in how it produces answers?
1. It generates long internal chains of reasoning before answering, spending extra compute at inference time
2. It requires users to write step-by-step instructions for every question
3. It can only respond to questions about mathematics and science
4. It uses more training data than any previous model
What is 'inference-time compute'?
1. The memory required to store model weights
2. The energy needed to cool data centers
3. The amount of data used to train a model initially
4. The processing power a model uses when generating an answer for a user
Why can a smaller model that thinks longer sometimes outperform a larger model that answers quickly?
1. The larger model uses too much memory and crashes on complex problems
2. Smaller models have access to more training data
3. The smaller model has been trained to use its thinking tokens more effectively, letting it work through problems step by step
4. Larger models always outperform smaller ones regardless of thinking time
What type of training method helps reasoning models learn to use their thinking tokens effectively?
1. Transfer learning from images
2. Supervised learning on more examples
3. Reinforcement learning
4. Unsupervised clustering
On which type of benchmark did o1 show the most dramatic improvement compared to previous models?
1. Sentiment analysis
2. Language translation accuracy
3. Hard mathematics, coding, and science benchmarks
4. Image recognition tests
What does the abbreviation RLVR stand for?
1. Remote Learning via Visual Recognition
2. Reinforcement Learning with Verifiable Rewards
3. Restricted Logical Vector Regression
4. Recursive Language Variable Reasoning
Why are multi-step math problems particularly challenging for AI models that don't use deliberate reasoning?
1. Only reasoning models can understand mathematical symbols
2. Errors at any step compound, leading to wrong final answers even if most steps are correct
3. The models don't have enough memory to store long numbers
4. Math problems require understanding visual information
Which of these companies released reasoning model products in response to OpenAI o1?
1. Apple, Microsoft, and Amazon
2. Google (Gemini 2.0 Flash Thinking), DeepSeek (R1), and Anthropic (extended thinking mode)
3. Netflix, Spotify, and Disney
4. Tesla, SpaceX, and Rocket Lab
What does it mean that reasoning models 'reopened the scaling frontier'?
1. They made it possible to build infinitely large models
2. They proved that scaling had reached its limits
3. They eliminated the need for any compute at all
4. They introduced a new way to improve AI performance by scaling inference-time compute instead of just model size
How does o1's chain of thought differ from the 'chain of thought prompting' that users sometimes use with other models?
1. There is no difference - they are the same thing
2. Chain of thought only works with image-based models
3. o1's reasoning is internal and built into the model's training, while prompting requires users to explicitly request step-by-step thinking
4. o1 cannot do chain of thought at all
What is a key advantage of reasoning models for agentic tasks that involve multiple steps?
1. They are faster than other models
2. They can plan and reflect on their actions before executing them
3. They only work with text inputs
4. They require less memory to function
What is competitive programming particularly demanding for AI models?
1. It requires searching through many possible solutions and selecting the optimal one
2. It mainly tests vocabulary knowledge
3. It doesn't require any computational thinking
4. It only requires simple pattern matching
Why did the improvements on hard math benchmarks matter particularly for demonstrating reasoning capabilities?
1. Hard math benchmarks are easier than other tests
2. Math problems can be solved by simple pattern matching
3. Math problems require precise step-by-step reasoning where errors compound, making them good tests of deliberate thinking
4. Math is the only subject that matters for AI
What did the lesson mean when it said reasoning models traded 'fast answers for slow, deliberate thinking'?
1. These models cannot answer questions quickly
2. These models take longer to generate responses because they generate internal reasoning before answering
3. These models only work on slow computers
4. These models always give wrong answers quickly
What fundamental shift did reasoning models introduce to AI scaling?
1. They showed that compute can be moved from training time into inference time to improve performance
2. They eliminated the need for any training
3. They made GPUs unnecessary
4. They proved that larger models are always better

← Back to interactive lesson

Tendril · Builders · AI Foundations

Reasoning Models: OpenAI o1 and After

In 2024, a new class of models traded fast answers for slow, deliberate thinking, and benchmarks jumped.

25 min · Reviewed 2026

Thinking Longer, On Purpose

What reasoning models do well

Multi-step math and proofs where intermediate errors compound
Competitive programming problems requiring search
Scientific reasoning on benchmarks like GPQA Diamond
Agentic tasks that benefit from planning and reflection

We've developed a new series of AI models designed to spend more time thinking before they respond.
— OpenAI, o1 announcement, 2024

The big idea: reasoning models reopened the scaling frontier by moving compute from training time into inference time. A model that can think longer is a different kind of model.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-history-reasoning-models-builders

What makes OpenAI o1 different from earlier GPT models in how it produces answers?
1. It generates long internal chains of reasoning before answering, spending extra compute at inference time
2. It requires users to write step-by-step instructions for every question
3. It can only respond to questions about mathematics and science
4. It uses more training data than any previous model
What is 'inference-time compute'?
1. The memory required to store model weights
2. The energy needed to cool data centers
3. The amount of data used to train a model initially
4. The processing power a model uses when generating an answer for a user
Why can a smaller model that thinks longer sometimes outperform a larger model that answers quickly?
1. The larger model uses too much memory and crashes on complex problems
2. Smaller models have access to more training data
3. The smaller model has been trained to use its thinking tokens more effectively, letting it work through problems step by step
4. Larger models always outperform smaller ones regardless of thinking time
What type of training method helps reasoning models learn to use their thinking tokens effectively?
1. Transfer learning from images
2. Supervised learning on more examples
3. Reinforcement learning
4. Unsupervised clustering
On which type of benchmark did o1 show the most dramatic improvement compared to previous models?
1. Sentiment analysis
2. Language translation accuracy
3. Hard mathematics, coding, and science benchmarks
4. Image recognition tests
What does the abbreviation RLVR stand for?
1. Remote Learning via Visual Recognition
2. Reinforcement Learning with Verifiable Rewards
3. Restricted Logical Vector Regression
4. Recursive Language Variable Reasoning
Why are multi-step math problems particularly challenging for AI models that don't use deliberate reasoning?
1. Only reasoning models can understand mathematical symbols
2. Errors at any step compound, leading to wrong final answers even if most steps are correct
3. The models don't have enough memory to store long numbers
4. Math problems require understanding visual information
Which of these companies released reasoning model products in response to OpenAI o1?
1. Apple, Microsoft, and Amazon
2. Google (Gemini 2.0 Flash Thinking), DeepSeek (R1), and Anthropic (extended thinking mode)
3. Netflix, Spotify, and Disney
4. Tesla, SpaceX, and Rocket Lab
What does it mean that reasoning models 'reopened the scaling frontier'?
1. They made it possible to build infinitely large models
2. They proved that scaling had reached its limits
3. They eliminated the need for any compute at all
4. They introduced a new way to improve AI performance by scaling inference-time compute instead of just model size
How does o1's chain of thought differ from the 'chain of thought prompting' that users sometimes use with other models?
1. There is no difference - they are the same thing
2. Chain of thought only works with image-based models
3. o1's reasoning is internal and built into the model's training, while prompting requires users to explicitly request step-by-step thinking
4. o1 cannot do chain of thought at all
What is a key advantage of reasoning models for agentic tasks that involve multiple steps?
1. They are faster than other models
2. They can plan and reflect on their actions before executing them
3. They only work with text inputs
4. They require less memory to function
What is competitive programming particularly demanding for AI models?
1. It requires searching through many possible solutions and selecting the optimal one
2. It mainly tests vocabulary knowledge
3. It doesn't require any computational thinking
4. It only requires simple pattern matching
Why did the improvements on hard math benchmarks matter particularly for demonstrating reasoning capabilities?
1. Hard math benchmarks are easier than other tests
2. Math problems can be solved by simple pattern matching
3. Math problems require precise step-by-step reasoning where errors compound, making them good tests of deliberate thinking
4. Math is the only subject that matters for AI
What did the lesson mean when it said reasoning models traded 'fast answers for slow, deliberate thinking'?
1. These models cannot answer questions quickly
2. These models take longer to generate responses because they generate internal reasoning before answering
3. These models only work on slow computers
4. These models always give wrong answers quickly
What fundamental shift did reasoning models introduce to AI scaling?
1. They showed that compute can be moved from training time into inference time to improve performance
2. They eliminated the need for any training
3. They made GPUs unnecessary
4. They proved that larger models are always better

← Back to interactive lesson