Everywhere seems to be full of hype around o3 since Friday’s annoucement from OpenAI, I thought I’d summarise a few points I’ve seen shared in v

Hot takes on o3 | Tom Hipwell

submited by

Style Pass

2024-12-23 17:00:03

Everywhere seems to be full of hype around o3 since Friday’s annoucement from OpenAI, I thought I’d summarise a few points I’ve seen shared in various places but not yet gathered in one place. We’re going to zoom in mostly on the ARC-AGI results, as I think that is the most interesting part. Before we do that, let’s introduce the ARC challenge.

ARC (Abstract Reasoning Corpus) was designed/created by François Chovllet, Author of both Deep Learning with Python and the Keras framework (and ex-Google). The intent behind the benchamrk was to set a “North Star” milestone for AGI. The test is not intended to answer “AGI achieved y/n?” but instead plots a waypoint on the course towards AGI.

To do this, the big idea behind the test was not to measure skill (completing a bar exam for example) but instead measure intelligence. Skill is not a good proxy for intelligence as it’s heavily influenced by prior knowledge and experience. If our benchmarks measure skill it is quite easy to create an illusion of intelligence by collecting loads of training data and then training the model as a compression of that data. If we focus too much on skill we end up with a form of AI that generalises poorly - the AI will not be able to acquire new skills outside of it’s training data. A lot of benchmarks used to test humans are easily breached by AI as they’re really targeting knowledge retention. As an example, a human domain expert on the MMLU benchamrk gets 89%, whereas Claude 3.5-Sonnet gets 88%.

To measure intelligence more effectively, the ARC test was designed to be hard for the AI to pass but easy for humans. To do this, it relys on very little prior knowledge (mostly a bit of geometry and the ability to count) and instead focuses on flexibility and adaptability; the ability to acquire a new skill rather than use an existing one - something that humans are very good at (worth noting, ARC is also a human oritentated benchmark). The ARC tasks have high generalization difficulty - there’s a lot of uncertainty about how to solve each question because all of the knowledge required to solve the task is only contained in the task itself.