Using window attention with attention sink tokens allows pretrained chat-style LLMs, such as all Llama, Mistral, MPT, Falcon, and GPT-NeoX (Pythia) models, to stay fluent across hundreds of subsequent prompts, unlike when these models are loaded using transformers. Furthermore, this approach allows for constant memory usage, while most LLMs loaded with transformers have linear space complexity resulting in memory issues.
Large Language Models (LLMs) have taken the industry by storm and fast-forwarded the field of chatbots and virtual assistants. LLMs seem particularly adept at acting as a (specialised) personal assistant, but they suffer from various limitations. In this blogpost, we will focus on the following two major restrictions:
VRAM usage: Many LLMs (e.g. Llama 2) suffer from linear space complexity during inference time. In a chat-assistant setting, this means that the VRAM limit of your device will constrain the ability for the user to keep prompting sequentially.
Loss of Fluency: All LLMs that have been trained so far suffer from a loss of fluency as the input grows too long. When this occurs, the model will lose the ability to produce language, and starts generating e.g. endless newlines, arbitrary characters (0OgOATO0OATO), broken unicode (���) or repeated words (assistant: assistant: assistant: assistant:).