I haven't been writing the "Tech Things" series long enough to claim that there are any long running themes of this blog, but one theme that

Tech Things: AI Benchmarks, O3, and the Future of SWE

submited by
Style Pass
2024-12-23 16:00:18

I haven't been writing the "Tech Things" series long enough to claim that there are any long running themes of this blog, but one theme that you could maybe point to is the belief that, contrary to naysayers, AI is going to continue growing and continue becoming more powerful. One reason for this belief is that AI systems continue to outperform expectations. We set challenges and benchmarks that top scientists believe are untouchable, only for AI to blow past them within months.

If you were thinking about this from first principles, there are a few kinds of challenges that you maybe want to use to put AI through the paces. You want to make sure your challenges aren't easily google-able — otherwise the answers may just be in the training set for these models. You want to make sure these challenges are good proxies for usefulness — no one cares if your model is really good at underwater basket weaving, that's just not an industry anyone cares about. You want to make sure that these challenges are easy for humans but hard for AI — if your AI system can count the number of R's in strawberry properly, maybe those naysayers will finally shut up about how dumb your systems are and start appreciating your AI's (and by proxy, your) genius.

In general, if your goal is AGI, you want to construct benchmarks that are reasonable approximations of intelligence. And then, if you get really good at those benchmarks, you can say you've achieved AGI.

Leave a Comment