To build reliable AI software, i.e. wrangle the non-determinism inherent with LLMs, one must graduate from shipping on “vibes” to instead invest i

DAGWorks’s Substack

submited by
Style Pass
2024-12-30 17:30:02

To build reliable AI software, i.e. wrangle the non-determinism inherent with LLMs, one must graduate from shipping on “vibes” to instead invest in building systematic evaluations. Much like unit test suites to ensure standard software is reliable, LLM-powered software also needs a suite of tests to directionally assess or evaluate reliability. The hard part is that these tests aren’t as simple as those of regular software.

In this post we show one possible approach to building an LLM application test suite and using it in a test driven manner - with pytest! If you aren’t using python, don’t use pytest, or don’t use Burr, this post will still be helpful at a high level — you can replicate the learnings with other testing libraries/systems. Overall, this post should leave you with:

Side note: if this topic interests you Hugo Bowne-Anderson and Stefan Krawczyk have a Maven Course: “Building LLM Applications for Data Scientists and Software Engineers” starting January 6th covering content like this to help you ship reliable AI applications.

Leave a Comment