Every week, it seems like another AI provider releases a state-of-the-art model. These announcements come with impressive benchmarks, but those benchmarks rarely reflect real-world use cases. So, how do you know if the new model is worth deploying in your app?
To gauge if a particular model will improve your application, it’s first worth understanding how well your app is currently performing. In AI applications, performance is measured using evaluations that consider the accuracy or quality of the LLM outputs. Setting up a baseline is simple— the easiest way is to run an eval with a set of data, the AI function you want to test, and a scoring function.
The best way to evaluate a new AI model is by testing it against the actual data your app handles in production. Generic benchmarks might give a sense of performance, but only your data can reveal how well a model works in your product. To do this in Braintrust, start by pulling real logs from your app and organizing them into a dataset. Consider choosing a set of logs that are underperforming to see if the new model makes an impact on the scores.
Then, use the dataset to run an evaluation using the new model and directly compare the performance (and other factors like cost, tokens, and more) against the one you’re already using.