At Brium, We’re dedicated to enabling ML applications on a diverse set of architectures and unlocking the hardware capabilities through engineering choices, made at every level of stack — from model inference systems, through runtime systems and ML frameworks, to compilers.
In recent years, the hardware industry has made strides towards providing viable alternatives to NVIDIA hardware for server-side inference to address exponentially growing demand for computing power. Today, a lot of hardware, such as AMD’s Instinct GPUs, offer strong performance characteristics, but it remains a challenge to harness that performance in practice. At Brium, we intend to enable efficient LLM inference on any hardware.
On the software side, Long-context Large Language Model (LLM) inference has become crucial for applications ranging from video understanding to retrieval-augmented generation, code assistance and even novel chain-of-thought approaches that enhance model accuracy.
In this post we’ll compare Brium’s inference platform with popular inference serving solutions such as VLLM and SGLang on AMD’s MI210 and MI300 series and show how Brium’s stack can translate to improved throughput as well as improved latency. As usual with inference, shorter latency will increase application responsiveness, higher throughput will likely decrease the total cost of ownership (TCO) of an inference system, and a combination of both may unlock new AI applications.