Almost a year ago, OpenAI introduced o1, the world’s first “reasoning model”. Compared to its likely predecessor GPT-4o, o1 is more heavily opti

Quantifying the algorithmic improvement from reasoning models

submited by

Style Pass

2025-08-07 20:30:04

Almost a year ago, OpenAI introduced o1, the world’s first “reasoning model”. Compared to its likely predecessor GPT-4o, o1 is more heavily optimized to do multi-step reasoning when solving problems. So it’s perhaps no surprise that it does much better on common math and science benchmarks.

o1 performs far better than GPT-4o on GPQA diamond (PhD-level multiple-choice science questions) and MATH level 5 (high-school math competition problems).1 Data is taken from Epoch AI’s benchmarking hub.

By itself, this performance improvement was already a big deal. But what’s even more important was how it was achieved: This wasn’t achieved by using a lot more training compute. Instead, this was the byproduct of a major algorithmic innovation. o1 went through a period of “reasoning training”, where its chain-of-thought was fine-tuned on reasoning traces and optimized using reinforcement learning. This allows the model to spend more time “reasoning” before responding to user queries.

But how can we quantify the importance of this algorithmic innovation? One way to do this is to interpret its importance in terms of a hypothetical increase in training compute. Specifically, we exploit the empirically robust relationship that more pre-training compute leads to better benchmark performance. This relationship allows us to ask a key question: how much more pre-training compute would GPT-4o have needed to match o1’s performance?2 We call this the “compute-equivalent gain” (CEG), and it is a standard way of quantifying the contribution from algorithmic innovations.

Quantifying the algorithmic improvement from reasoning models

Leave a Comment

Related Posts

Recent Posts

Staggering U.S. Tariffs Begin as Trump Widens Trade War

Dustin Ballard: Is AI ruining music? | TED Talk

IMPROVING OVERSIGHT OF FEDERAL GRANTMAKING

Why I'm Joining Substack - Derek Thompson

In trial, people lost twice as much weight by ditching ultraprocessed food

How we shipped GPT-5 support before lunch

Mathematicians use ‘neglected’ particles that could rescue quantum computing

🔒 "Backed by YC" Embeddable Widget

Search code, repositories, users, issues, pull requests...

A genetically tractable non-vertebrate system to study complete camera-type eye regeneration

Flipper Zero DarkWeb Firmware Bypasses Rolling Code Security

Achieving 10,000x training data reduction with high-fidelity labels

Worktrees: Git's best kept secret (and why you should use them)

Elon Musk outlines AI-led Grok future for advertising on X

Conspiracy Theories & the Paranoid Style: Richard Hofstadter’s X-Ray of the American Mind

Build Your Following Without Living Inside Your Social Media Apps

Computer Science > Machine Learning

Eat Your Potatoes Mashed, Boiled or Baked, but Hold the Fries | Harvard Magazine

Apple executives have held internal talks about buying AI startup Perplexity

Is AI ‘The Ultimate Version of Google,’ As Larry Page Wanted?