At Character.AI, we're building toward AGI. In that future state, large language models (LLMs) will enhance daily life, providing business productivit

Optimizing AI Inference at Character.AI

submited by

Style Pass

2024-07-04 18:30:03

At Character.AI, we're building toward AGI. In that future state, large language models (LLMs) will enhance daily life, providing business productivity and entertainment and helping people with everything from education to coaching, support, brainstorming, creative writing and more.

To make that a reality globally, it's critical to achieve highly efficient "inference" – the process by which LLMs generate replies. As a full-stack AI company, Character.AI designs its model architecture, inference stack and product from the ground up. And we’re excited to share that we have made a number of breakthroughs in inference technology – breakthroughs that will make LLMs easier and more cost-effective to scale to a global audience.

Our inference innovations are described in a technical blog post released today and available here. In short: Character.AI serves around 20,000 queries per second – about 20% of the request volume served by Google Search, according to public sources. We manage to serve that volume at a cost of less than one cent per hour of conversation. We are able to do so because of our innovations around Transformer architecture and “attention KV cache” – the amount of data stored and retrieved during LLM text generation – and around improved techniques for inter-turn caching.

These innovations, taken together, make serving consumer LLMs much more efficient than with legacy technology. Since we launched Character.AI in 2022, we have reduced our serving costs by at least 33X. It now costs us 13.5 times less to serve our traffic than it would cost a competitor building on top of the most efficient leading commercial APIs.