Generative AI is in its Cambrian Explosion phase. So many people are taking the models and messing around with them in so many ways to try and see if they can make something work. There's just one problem: how do you tell if a model is good or not without actually setting it up and using it?
One of the biggest points of comparison is the ELO score that each model gets when it's added to the LMSYS Chatbot Arena leaderboard. This is generally seen as a consistent way to evaluate models, but it's kinda costly to do these evaluations and takes a fairly long time to converge on a ranking that "makes sense". My favorite model Hermes 3 70b isn't currently on the leaderboard (probably because it was released like two weeks ago), but the model it's based on (Facebook's Llama 3.1 70b Instruct model) currently ranks 16th.
As a way for researchers to perform epic rock ballads in the form of expensive research that results in papers on how their models and training methodologies compare on benchmarks. These benchmarks allow you to get scores that are easy to publish, but also easy to validate. These benchmarks have names like AGIEval, Hellaswag, Winograde, and MMLU-PRO. They allegedly test the performance of the models for doing human-useful tasks such as completing sentences, general knowledge (the kind that the Gaokao, law school bar exam, and SAT test), and common sense reasoning capabilities. These benchmarks align with what researchers think large language models are useful for.