On Friday, OpenAI announced their new o3 reasoning model, solving an extremely impressive 366/400 tasks in the ARC-AGI-Pub dataset. I published a visualization of the tasks that o3 couldn’t solve; you can see those examples, and get a feel for the hardest ARC tasks, here:
There were a number of different failure modes seen that data, but as some of you observed, there were a number of cases where o3 appeared to be struggling with the data format rather than understanding the task itself.
I’d like to draw attention to a pattern I’ve noticed since we started looking at the first generation of LLMs last year, and which might explain some of what we’ve seen on ARC this year.
One of the unique aspects of ARC is that the tasks come in many shapes and sizes - each has a variable number of examples and a rectangular output size between 1x1 and 30x30 pixels. This is in service of its goal to present 400 unique reasoning challenges, each problem requiring some different intuition within roughly some core knowledge principles.
o3 does extremely well at ARC overall, but it gets interesting when we stratify ARC solve rates by problem size (total number of pixels in the training grids 1 ):