As part of the Brokk Power Ranking of coding models coming next week, we’re pleased to present the first independent numbers for GPT-OSS performance!
To put it in context, we’ve included the performance of the other recent open model releases, as well as o4-mini and Gemini Flash 2.0 as known-quantity comparisons.
“Roughly as good as Flash 2.0” is a disappointing result, but let’s put it in context: as a 120 billion parameter model quantized to FP4, it’s roughly 1/16 the size of Qwen 3 Coder, DeepSeek-V3, or Kimi K2. Alas, it seems that size still matters: GPT-OSS has a ton of trouble generating valid edit blocks which makes it hard to get anything across the finish line.
Kimi K2 doesn’t have that excuse and it’s still bad. You may remember Kimi showcasing K2 handily beating DeepSeek-V3 at other coding benchmarks. When the discrepancy between the performance in and older benchmark and a new one is this large, it’s hard to avoid the conclusion that Kimi trained K2 against the test.
By contrast, Qwen 3 Coder (480B, unquantized) is the real deal and finally dethrones DeepSeek-V3 as the best non-thinking model for coding.