Day zero model performance optimization work is a mix of experimentation, bug fixing, and benchmarking guided by intuition and experience. This writeup outlines the process we followed to achieve SOTA latency and throughput for GPT OSS 120B on NVIDIA GPUs at launch with the Baseten Inference Stack.
The day an open source model like OpenAI’s new gpt-oss-120b is released, we race to make the model as performant as possible for our customers. As a launch partner for OpenAI’s first open-source LLM since 2019, we wanted to give developers a great experience with the new LLMs.
By the end of launch day, we were the clear leader running on NVIDIA GPUs for both latency and throughput per public data from real-world use on OpenRouter.
Optimizing performance on a new model is a substantial engineering challenge. Thanks to our flexible inference stack and the collective expertise of our model performance engineering team, we are able to roll out performance improvements by the hour on new models.
The first step is running baseline inference however possible. Running inference on a model requires support at the inference framework, hardware architecture, and model server level.