Creating evals for LLM-powered applications is hard. We could spend weeks and still not have evals that correlate well with application-specific perfo

A Builder's Guide to Evals for LLM-based Applications

submited by
Style Pass
2024-04-01 14:00:03

Creating evals for LLM-powered applications is hard. We could spend weeks and still not have evals that correlate well with application-specific performance. Furthermore, general benchmarks such as MMLU and MT-Bench are noisy indicators—improved performance on them don’t always translate to improved performance on task-specific evals.

In this write-up, we’ll discuss some evals I’ve found useful. The goal is to spend less time creating evals so we have more time to build applications. We’ll focus on evals for simpler, common tasks such as classification/extraction, summarization, and translation. We’ll also see how to assess the risk of copyright regurgitation and toxicity.

At the end, we’ll discuss the role of human evaluation and how to calibrate the evaluation bar to balance between potential benefits and risks, and mitigate Innovator’s Dilemma.

Note: I’ve tried to make this accessible for folks who don’t have a data science or machine learning background. Thus, it starts with the basics of classification eval metrics. Feel free to skip any sections you’re already familiar with.

Leave a Comment