TL;DR: In September 2024, OpenAI released o1, its first

o1: A Technical Primer — LessWrong

submited by
Style Pass
2024-12-22 01:00:04

TL;DR: In September 2024, OpenAI released o1, its first "reasoning model". This model exhibits remarkable test-time scaling laws, which complete a missing piece of the Bitter Lesson and open up a new axis for scaling compute. Following Rush and Ritter (2024) and Brown (2024a, 2024b), I explore four hypotheses for how o1 works and discuss some implications for future scaling and recursive self-improvement.

The Bitter Lesson is that "general methods that leverage computation are ultimately the most effective, and by a large margin." After a decade of scaling pretraining, it's easy to forget this lesson is not just about learning; it's also about search. 

OpenAI didn't forget. Their new "reasoning model" o1 has figured out how to scale search during inference time. This does not use explicit search algorithms. Instead, o1 is trained via RL to get better at implicit search via chain of thought (CoT). This was the simplest possible way to incorporate search into LLMs, and it worked.  

The key consequence is that OpenAI has opened up a new frontier of the bitter lesson: test-time scaling (see figure above). The original scaling laws taught us how to exchange training-time compute for better predictions. These new test-time scaling laws teach us how to exchange inference-time compute for better decisions. 

Leave a Comment