We’re releasing RE-Bench, a new benchmark for measuring the performance of humans and frontier model agents on ML research engineering tasks. We also share data from 71 human expert attempts and results for Anthropic’s Claude 3.5 Sonnet and OpenAI’s o1-preview, including full transcripts of all runs.
Each of the 7 environments in the benchmark is centered around a research engineering task, such as fitting a scaling law or optimizing a GPU kernel. The environments were selected in consultation with ML researchers in academia and top industry labs for realism and coverage. In each environment, the agent, which can be a model or a human, is given access to a computer (often with several GPUs), a scoring function (e.g., maximizing accuracy on a dataset or making a training loop run faster), and any other necessary resources. The agent then is instructed to score as high as possible within a fixed time limit.
In general, we find that AI agents perform better than humans at AI research engineering tasks when both are given 2 hours, but they perform worse at higher time budgets (see the section under Results for more detail on what we mean by “time budget”). Compared to our tasks, real-world ML research often involves much larger projects using more compute over longer periods of time, so we think that the high-budget results are the most important.