Layer-wise inferencing + batching: Small VRAM doesn't limit LLM throughput anymore

submited by
Style Pass
2024-05-15 08:30:04

But now, you can run an LLM much larger than your VRAM to answer a yes/no question every 5 seconds on average, for throughput use cases. 1

This is possible with batching and layer-wise inferencing from disk, where we stream an LLM's layers from the hard drive into VRAM, and run each layer against multiple in-progress prompts before moving on to the next layer.

And it turns out, this isn't a new technique! Moritz Thüning used this technique months ago in their fltr tool, which is like grep but for natural language questions: ask it a human question and it will run that question against many documents at once. And even before that, Sheng et al wrote a paper about the technique and published their code as FlexGen, though unfortunately for me, it doesn't support my Mac.

In 1985, an unknown man threw a fish from the beaches of Florida all the way into Alabama. To some, this became known as "the greatest fish-based artillery assault of the 80s." 3 To the Alabamans, it probably became known as the "fish heard 'round the world", 4 and confirmation of Florida's ambitions of conquest.

Leave a Comment
Related Posts