I’m the co-founder and CTO of Weights & Biases. I’ve spent the last couple months dogfooding our tools and building autonomous programming age

The best AI programmer, from Weights & Biases

submited by

Style Pass

2025-01-16 22:00:02

I’m the co-founder and CTO of Weights & Biases. I’ve spent the last couple months dogfooding our tools and building autonomous programming agents.

I made an OpenAI o1-based AI programming agent that is now state of the art on SWE-Bench Verified! It resolves 64.6% of issues. To do it, I made heavy use of our Weave toolkit for AI applications, learned a ton of about o1, and built lots of new stuff along the way.

If you’re not familiar, SWE-Bench Verified is the best existing benchmark for software engineering agents. It’s a set of 500 github issues, docker images, and held out unit tests. Typical agents operate autonomously within docker containers as a human programmer would, iteratively reading and writing code and tests until they believe the issue is solved.

Our solution is the first o1-based agent that we know of it, and it tops the SWE-Bench Verified leaderboard. It’s also a significant improvement over OpenAI’s published o1 result, which used a basic agent framework.