In our last post, we showed that, with some careful prompting, it is possible to use ChatGPT as a replacement for OpenAI Codex for generating SQL.
While the Codex models are still in a free limited beta, they are eventually likely to be more expensive than ChatGPT. Switching to the (very inexpensive) ChatGPT API, then, could be highly beneficial. But only if it works at least as well as the Codex model.
Our goal is to evaluate the quality of SQL-generating LLM prompts and to compare different language models. But how do we determine what constitutes a good prompt, and how do we measure the performance of different language models?
We have developed a minimal test suite for SQL-generating LLMs and their prompts. This test suite establishes a baseline for performance, and helps us evaluate the quality of different prompts and language models.
In this post, the first in a series on testing prompts, we will discuss the challenges involved in testing text-to-SQL translation, introduce our test suite, and compare the performance of Codex and ChatGPT using the suite.