We’ve had RunLLM in users’ hands for 4 months now, and we’ve learned a lot. Most of what we’ve learned has obviously turned into product impro

Generating Conversation

submited by
Style Pass
2024-04-18 21:30:04

We’ve had RunLLM in users’ hands for 4 months now, and we’ve learned a lot. Most of what we’ve learned has obviously turned into product improvements, but we’ve also started to notice some clear trends in how customers are evaluating products. These observations are mostly to help teams selling early-stage AI products, but we also hope that they will be gentle nudges for those on the other side of the table.

As we’ve talked about recently, evaluating LLMs and LLM-based products is hard, so we can’t blame anyone evaluating a new tool for not having a clear framework. Unlike evaluations of a CRM or a database, there aren’t set criteria or neat feature matrices. In other words, we’re all making it up as we go. 🙂

You learn by touching the stove. Perhaps the most interesting trend we’ve found is that some of our best customers have tended to be those who have already tried to build a custom assistant in-house or with a third-party vendor. They’ve seen what an AI product that’s not fully matured looks like and how highly variable the responses can be, so they know what kinds of failure modes to look for. They can also recognize higher quality responses more quickly.

These have been the teams who have also shown up to our conversations with pre-defined evaluation sets — they start by baselining their expectations of our assistant with that validation set and then continuing from there. Good performance on the validation set is probably not a sign of good quality, but poor performance is an indicator of bad quality. The flip side is that teams who haven’t yet experienced an AI assistant tend to rely on vibes-only (more below).

Leave a Comment