I’ve been working on extractGPT, a tool powered by large language models (LLMs) that extracts structured data from web pages.
I recently wanted to switch the underlying model from OpenAI’s old GPT-3 model to the more affordable ChatGPT model. But first, I needed to make sure that the new model would perform just as well for my use case.
There are dozens of startups building tools for instrumenting API calls to large-language models. They let you do useful things like seeing cost & latency, user feedback, evaluate different prompts, and collect examples for fine-tuning. But because they only instrument the API call, they are not good at answering the ultimate question I care about: if I change part of the pipeline, are users going to get better or worse results?
LLMs are great, but to build a useful production app, you often need to do a bunch of pre-processing before you call them, and post-processing on the results to get an acceptable output.