One of the first things you learn when going down the path of building applications with LLMs is that you need evals. Evals are basically like traditi

Running LLM evals right next to your code

submited by
Style Pass
2025-01-15 11:00:08

One of the first things you learn when going down the path of building applications with LLMs is that you need evals. Evals are basically like traditional software tests, but for interactions with the LLMs. Because LLMs are not deterministic, this is typically done by assigning scores between 0 and 1 to results, based on some scoring mechanism. Various platforms for storing and displaying eval results exist already, but wouldn’t it be nice if they could live in your existing test system, and be run and displayed in CI, just like your tests? I’ve explored just that, and in this post, I’ll show you what I found. It involves a SQLite database for tracking results over time, PR comments with Markdown tables, and a lot of Github Actions workflow YAML.

Note: I’ve been looking for an eval pipeline for a while. My applications are mainly written in Go, but there is actually nothing Go-specific in the approach you’re about to read about. If you have a way to run evals in your language and tooling of choice, you should be able to build something similar, or reuse some of what I’ve built.

Leave a Comment