Aider works best with LLMs which are good at editing code, not just good at writing code. To evaluate an LLM’s editing skill, aider uses a pair of b

Aider LLM leaderboards

submited by

Style Pass

2024-05-07 13:30:05

Aider works best with LLMs which are good at editing code, not just good at writing code. To evaluate an LLM’s editing skill, aider uses a pair of benchmarks that assess a model’s ability to consistently follow the system prompt to successfully edit code.

The leaderboards below report the results from a number of popular LLMs. While aider can connect to almost any LLM, it works best with models that score well on the benchmarks.

Aider’s code editing benchmark asks the LLM to edit python source files to complete 133 small coding exercises. This benchmark measures the LLM’s coding ability, but also whether it can consistently emit code edits in the format specified in the system prompt.

Aider’s refactoring benchmark asks the LLM to refactor 89 large methods from large python classes. This is a more challenging benchmark, which tests the model’s ability to output long chunks of code without skipping sections or making mistakes. It was developed to provoke and measure GPT-4 Turbo’s “lazy coding” habit.

The refactoring benchmark requires a large context window to work with large source files. Therefore, results are available for fewer models.

Download - LangLandia

Comment

How Lucky Is Too Lucky in Minecraft? Political Calculations:

Comment

Outrun - weekly step contest 4+

Comment

Aider LLM leaderboards

Leave a Comment

Related Posts

Download - LangLandia

How Lucky Is Too Lucky in Minecraft? Political Calculations:

Outrun - weekly step contest 4+

Recent Posts

Mobile industry is quietly preparing for the biggest change to your smartphone in a decade — iSIM will hasten the end of SIM cards and allow networks to preload plans on devices

Investors are scooping up roughly 1 in 5 homes sold in the housing market and making more money than before

Introducing: 5 Quick Advertising Lessons for Busy Makers, Builders, & Engineers

Police conclude investigation into suicide of Boeing whistleblower

Distinctions and Concepts: Risk/Reward Modeling

Verjus, top pesto, umeboshi: are restaurant menus becoming more baffling?

Search code, repositories, users, issues, pull requests...

Overcoming Self-Reference Modelling | Paola Cattabriga

Find Humane

Woke invades the sciences

Device and browser info

group chats rule the world

VCs and the military are fueling self-driving startups that don’t need roads

26,000 Days on Vimeo

A startup’s “tablet” gears up to take on Apple’s iPad

Optimally Irrational

State of Compute Access: How to Bridge the New Digital Divide

Search code, repositories, users, issues, pull requests...

Let me be | Epic Web Dev

Lessons Learned Building an Open-Source Answer Engine