ChatGPT is Slipping :: Adriano Caloiaro's personal blog

submited by

Style Pass

2024-11-18 05:30:04

Three months ago I put code in production that utilizes gpt-4o and/or gpt-4o-mini models to analyze feedback about businesses and categorize it. The prompts instruct the models to identify categories of feedback, and in a second phase, extract some examples of what people said. This is a simplification, but it took very little effort to craft some prompts that enabled even the meager gpt-4o-mini model to do exactly that. It didn’t feel like a stretch to imagine that this use case was well within ChatGPT’s limits based on the minimal effort put into a working solution. The results were genuinely useful, and the effort was low. It seemed like an obvious win.

The code ran every two weeks, and the models had done an admirable job every time. Before putting results in front of users, I let the code run silently, only visible to a select few within our company. Very subjectively, we were satisfied with ChatGPT’s results and figured it was time to put them in front of users. So I threw some basic integration tests around the feature, gave it the old go test -count 1000 ..., and released it to the world.

Like most software engineers, I’m all too familiar with flaky tests, and was hesitant that tests around LLM output would be sufficiently predictable. But, running a test suite with go test -count 1000 without failure is pretty confidence inspiring. And it turns out that my hesitation was not terribly well-founded. The tests simply never failed for three months. I was so skeptical that a few times over the last month, I cracked the tests open and made sure we weren’t simply getting false negatives. We weren’t, and tests were running multiple times a day on local development machines and during both staging and production deployments. The tests were running about 50 times per week and remained rock solid.

ChatGPT is Slipping :: Adriano Caloiaro's personal blog

Leave a Comment

Related Posts

Recent Posts

How They Control You - Defrag Zone

A brief history of Mac ports – low speed

Launch Your Product Across 100+ Directories In Just One Click

Removing global state from LLD | MaskRay

Science Talk - What Are Pulsar Planets?

‘Who Really Wrote the Bible’ by William M. Schniedewind review

A Guide to the Deep Dive Leadership Principle

Hera telemetry

Phishing emails increasingly use SVG attachments to evade detection

Valve, Steam & the Entire PC Gaming Industry Were Saved by a Single Intern

Two Reports Walk into My Inbox...

Observability as a superpower

Inside the Republican National Committee’s Poll-Watching Army

Retraction: Preliminary evaluation of a novel nine-biomarker profile for the prediction of autism spectrum disorder | PLOS ONE

Malala: I never imagined women's rights would be lost so easily

British Museum given its most valuable gift ever

ChatGPT is Slipping :: Adriano Caloiaro's personal blog

Machine Learning Diaries

Lume: Stress & Wellness Coach 12+

Biden allows Ukraine to strike inside Russia with missiles