Day zero model performance optimization work is a mix of experimentation, bug fixing, and benchmarking guided by intuition and experience. This writeu

How we run GPT OSS 120B at 500+ tokens per second on NVIDIA GPUs

submited by

Style Pass

2025-08-07 02:30:02

Day zero model performance optimization work is a mix of experimentation, bug fixing, and benchmarking guided by intuition and experience. This writeup outlines the process we followed to achieve SOTA latency and throughput for GPT OSS 120B on NVIDIA GPUs at launch with the Baseten Inference Stack.

The day an open source model like OpenAI’s new gpt-oss-120b is released, we race to make the model as performant as possible for our customers. As a launch partner for OpenAI’s first open-source LLM since 2019, we wanted to give developers a great experience with the new LLMs.

By the end of launch day, we were the clear leader running on NVIDIA GPUs for both latency and throughput per public data from real-world use on OpenRouter.

Optimizing performance on a new model is a substantial engineering challenge. Thanks to our flexible inference stack and the collective expertise of our model performance engineering team, we are able to roll out performance improvements by the hour on new models.

The first step is running baseline inference however possible. Running inference on a model requires support at the inference framework, hardware architecture, and model server level.

How we run GPT OSS 120B at 500+ tokens per second on NVIDIA GPUs

Leave a Comment

Related Posts

Recent Posts

Board of Governors of the Federal Reserve System

Search code, repositories, users, issues, pull requests...

Actual LLM agents are coming

That grumpy BSD guy: Elvis is alive! How 'AI' stunts modern mythmaking

Search code, repositories, users, issues, pull requests...

Sleep Ledger - Domo Futu

Your LLM Does Not Care About MCP

AI in production : reflecting on one year, five projects and dozens of factories deployed.

An Africanist Perspective

Search code, repositories, users, issues, pull requests...

FDA approves breakthrough eye drops that fix near vision without glasses

Meet the AI vegans | Arwa Mahdawi | The Guardian

whoa there, pardner!

APIs don't make good MCP tools

He’s the ‘Mozart’ of Math and Trump Killed His Funding

Madhava of Sangamagrama

Search code, repositories, users, issues, pull requests...

Bug Bounty: NVidia Reset Bug

Search code, repositories, users, issues, pull requests...

Apple made a 24k gold and glass statue for Donald Trump