OpenAI’s new o1 model with “high” reasoning effort gets the top score on the new aider polyglot leaderboard, significantly ahead of other top LLMs. The new polyglot benchmark uses many popular coding languages and was designed to be much more challenging than aider’s old code editing benchmark. This more clearly distinguishes the performance of today’s strongest coding models and leaves headroom for future LLMs.
Aider’s original code editing benchmark was saturating as the top scores approached and then surpassed 80%. Sonnet’s score of 84.2% was based on solving 112 of the 133 exercises, leaving only 21 unsolved exercises. New champions were advancing the top score by solving just 1-2 more problems than the previous record. This made it hard to clearly measure the difference in code editing skill between these top models.
Part of the problem is that many of the original 133 Python problems are very easy and provide little challenge to today’s frontier LLMs. Models as old as GPT 3.5 Turbo were able to solve half of the 133 problems. Such easy problems simply inflate the benchmark scores of modern LLMs without providing any data about which models are better or worse.