Image above generated by AI for this analysis (Imagen 3-002)1Image generated in a few seconds, on 22 December 2024, via Imagen 3-002, text prompt by A

o3: Stratospheric reasoning

submited by

Style Pass

2024-12-22 20:30:26

Image above generated by AI for this analysis (Imagen 3-002)1Image generated in a few seconds, on 22 December 2024, via Imagen 3-002, text prompt by Alan D. Thompson, ‘a zoomed out background header for ozone in the stratosphere, with lowercase title ‘o3’, otherworldly colors.’

GPQA Diamond=87.7% (o1=78.3%) AIME 2024 = 96.7% (only one question wrong) Codeforces: 99.8th percentile (score = 2727, o1=P94/1891) SWE-bench verified = 71.7% (o1=48.9%) FrontierMath = 25.2% (o1=2%)

Fields Medalist Timothy Gowers on the hundreds of questions in the FrontierMath benchmark (Nov/2024): ‘…all looked like things I had no idea how to solve… Getting even one question right would be well beyond what we can do now, let alone saturating them.’ [To score 25.2%, o3 must have got at least 63 of 250 questions correct]

“OpenAI is currently prepping the next generation of its o1 reasoning model, which takes more time to “think” about questions users give it before responding, according to two people with knowledge of the effort.