We’d like to share an LLM architectural pattern that we’ve found success with for dividing tasks between large and small language models. For many

Yule LOGIC – 12 Days of Fun LLM-Related Posts by LOGIC, Inc.

submited by
Style Pass
2024-12-30 23:00:06

We’d like to share an LLM architectural pattern that we’ve found success with for dividing tasks between large and small language models. For many tasks, it allows us to use smaller foundation models, like gpt-4o-mini while maintaining gpt-4o levels of capability. And there’s no fine-tuning involved! We show a 63% cost reduction without any loss in performance by taking advantage of high-precision few-shot examples.

The pattern exploits the few-shot learning capabilities of smaller models—quickly adapting to slightly new but well-defined tasks with just a handful of examples—and the zero-shot reasoning powers of larger models, which can handle completely novel instructions with no explicit training. We share a case-study below measuring the impact of this on real-world data and tasks.

A smaller model handles requests by default and only when a task exceeds some uncertainty or complexity does it escalate the request to a larger model. This approach can significantly reduce the computational overhead for average queries, since the larger model is only invoked when necessary. The models may even be running on different devices (e.g. a small model running on your phone and a large model running in a datacenter).

Leave a Comment