In our ongoing quest to help developers find the right libraries and LLMs for their use cases, we've turned our attention this month to benchmarking the latest five models:
Each model brings unique strengths, from Qwen2's rapid token generation to Llama's impressive efficiency under various token loads.
We tested them across six different inference engines (vLLM, TGI, TensorRT-LLM, Tritonvllm, Deepspeed-mii, ctranslate) on A100 GPUs hosted on Azure, ensuring a neutral playing field separate from our Inferless platform.
The goal? To help developers, researchers, and AI enthusiasts pinpoint the best LLMs for their needs, whether for development or production. If you missed our previous deep dive into 10B to 34B parameter models and which library to choose and why, you can catch up here.
This combination consistently ranks highest in tokens/sec performance, particularly with medium to high input and output tokens. It’s ideal for developers focused on maximizing throughput without sacrificing speed, making it the top choice for high-performance applications.