Mark Zuckerberg recently came out saying that “AI” will soon be able to do what mid-level engineers do. Meanwhile, some claim AGI (artificial general intelligence) is already here, while others are arguing about how many decades away it is.
Back during my time at Twitter, I was working on an ML evaluation tool that went beyond a single metric. The purpose of the tool was to give you a scorecard with everything you need to know about model performance in real life before launch. This was to work for every model at Twitter and beyond, as we were talking about open-sourcing it.
To those who do not deal with ML every day, this may seem like a hard problem: implementing a whole bunch of metrics for various architectures. Yet the real difficulty was understanding what to test and how.
Fast forward to December 2024, and the conversations I’ve had with the brightest minds in the industry at NeurIPS confirm we have no idea how to truly evaluate LLMs (which have become synonymous with AI), to say nothing about agents. We have bits and pieces of the answer (I helped with some of them), but no true consensus on what to measure and how to ensure the measurement is correct.