A lot has been speculated about how the o1 (and now) o3 family of models came to be. From the information OpenAI made available, it seems like they are running some type of self-improvement loop, since inference cost seems to scale closely with performance. That's vital for our quest to AGI because it follows the most important lesson from previous succesful models, like AlphaGo: in order to reach superhuman performance, the model should be capable of generating its own signals.
The following is a description of an algorithm that may reach something similar to what OpenAI built. Very bold statement, I know, but the idea seems sound and not that complicated. So if it is bullshit, at least it will be easy to verify.
Integral to this idea, it is this work from Google. The TLDR is that if you sample a reasonably well trained LLM a lot of times, the chances are high that you will eventually come up with the correct answer. This is the first insight: the LLM can be used to perform something akin to Markov simulations.
Now, one thing is saying that it is possible to reach the correct answer; another is to reach that correct answer given real-world constraints. So what we need is to bias the model to perform the reasoning steps most likely to reach the correct conclusion. How do we do that? Well, since an LLM can act like an unbiased estimator of the value of the current state (by counting how many times the LLM reach the correct conclusion if instantiated at that point), we just need to generate multiple variations for each step and keep the one that gives us the highest likelihood of getting the correct answer.