You might be leaving up to 60% performance gains on the table with the wrong response model. Response Models impact model performance massively with Claude and GPT-4o, irregardless of you’re using JSON mode or Tool Calling.
Using the right response model can help ensure your models respond in the right language or prevent hallucinations when extracting video timestamps.
We used OpenAI's GSM8k dataset to benchmark model performance. This dataset challenges LLM models to solve simple math problems that involve multiple steps of reasoning. Here's an example:
The original dataset includes reasoning steps and the final answer. We stripped it down to bare essentials: question, answer, and separated reasoning. To do so, we used this code to process the data:
This allows us to test how changes in the response format, response model and even the chosen model itself would affect reasoning ability of the model.