Dan’s Weekly AI Speech and Language Scoop #32

submited by
Style Pass
2024-10-31 23:00:03

OpenAI released an LLM benchmark called SimpleQA [csv of questions]. It is far from simple. This is pub trivia on steroids. I suspect I could reason my way to directionally correct answers for about 10% of them and don’t even fully understand what the question is referring to at least 25% of the time.

The best performing model, o1-preview, only answers 42.7% of the questions correctly while the worst, Claude-3-haiku, only scores 5.1%. Bad, right? Not so fast. While Claude-3-haiku only answers 5.1% correctly, it also only answers 19.6% incorrectly, passing on 75.3%. o1-preview misses 48.1%, making its correct answer rate worse than a coin flip.

Which would you prefer as a user? I personally would rather Haiku. When I am interacting with an LLM, especially over audio, I want the answer to be correct. I don’t want the additional mental overhead of having to check the 50% of answers. I want the model to know that the answers to these questions are probably not lurking in its parametric memory and to seek out the answer for me.

And existing LLM-based products for the most part do just this. From sampling a half dozen questions Meta AI, ChatGPT, and Gemini all score nearly perfectly on this benchmark with their search grounding (tragically, Claude does not offer this yet; surprised that they are releasing agents but still haven’t built search into their consumer product).

Leave a Comment