Transformers, although powerful, are very compute-intensive, scaling O(n²) in time and memory with the number of tokens. This makes scaling context-w

Gemma-10M Technical Overview

submited by
Style Pass
2024-05-10 00:00:06

Transformers, although powerful, are very compute-intensive, scaling O(n²) in time and memory with the number of tokens. This makes scaling context-windows for modern LLMs very challenging. In Gemma-10M, we merge insights from recurrent neural networks with local attention blocks, to capture long-term knowledge retention with O(1) memory and O(n) time. Thus, our solution allows models to expand to arbitrary context-sizes.

The biggest bottleneck for expanding is the expanding size of the KV-cache, which involves storing the Key-Value pairs in the attention table from prior tokens before computing the attention on the latest token. Not doing so increases the computational cost cubicly, making it almost necessary for longer-sequences. The GIF below illustrates this idea of a KV-cache.

However, storing this cache is expensive — specifically, quadratically so. Especially, when computing attention on contexts lengths of 1M, we are looking at 1,000,000 x 1,000,000 = 12 trillion entries, which we can’t fit in conventional hardware.

Leave a Comment