Layer-wise inferencing + batching: Small VRAM doesn't limit LLM throughput anymore

But now, you can run an LLM much larger than your VRAM to answer a yes/no question every 5 seconds on average, for throughput use cases. 1

This is possible with batching and layer-wise inferencing from disk, where we stream an LLM's layers from the hard drive into VRAM, and run each layer against multiple in-progress prompts before moving on to the next layer.

And it turns out, this isn't a new technique! Moritz Thüning used this technique months ago in their fltr tool, which is like grep but for natural language questions: ask it a human question and it will run that question against many documents at once. And even before that, Sheng et al wrote a paper about the technique and published their code as FlexGen, though unfortunately for me, it doesn't support my Mac.

