Three months ago I put code in production that utilizes gpt-4o and/or gpt-4o-mini models to analyze feedback about businesses and categorize it. The p

ChatGPT is Slipping :: Adriano Caloiaro's personal blog

submited by
Style Pass
2024-11-18 05:30:04

Three months ago I put code in production that utilizes gpt-4o and/or gpt-4o-mini models to analyze feedback about businesses and categorize it. The prompts instruct the models to identify categories of feedback, and in a second phase, extract some examples of what people said. This is a simplification, but it took very little effort to craft some prompts that enabled even the meager gpt-4o-mini model to do exactly that. It didn’t feel like a stretch to imagine that this use case was well within ChatGPT’s limits based on the minimal effort put into a working solution. The results were genuinely useful, and the effort was low. It seemed like an obvious win.

The code ran every two weeks, and the models had done an admirable job every time. Before putting results in front of users, I let the code run silently, only visible to a select few within our company. Very subjectively, we were satisfied with ChatGPT’s results and figured it was time to put them in front of users. So I threw some basic integration tests around the feature, gave it the old go test -count 1000 ..., and released it to the world.

Like most software engineers, I’m all too familiar with flaky tests, and was hesitant that tests around LLM output would be sufficiently predictable. But, running a test suite with go test -count 1000 without failure is pretty confidence inspiring. And it turns out that my hesitation was not terribly well-founded. The tests simply never failed for three months. I was so skeptical that a few times over the last month, I cracked the tests open and made sure we weren’t simply getting false negatives. We weren’t, and tests were running multiple times a day on local development machines and during both staging and production deployments. The tests were running about 50 times per week and remained rock solid.

Leave a Comment
Related Posts