The results from LLM benchmarks contain an apparent paradox: How can models have PhD-level performance but often fail at seemingly straightforward tasks? Why do many people not find them useful?
Even as a prolific user, I often find LLMs frustrating. It feels like I'm not using them quite right, and if I could somehow improve my prompts I'd be able to solve whole problems in one shot, rather than the iterative approach I rely on.
The underlying issue is that LLMs have a very different skill profile from humans. Only by understanding their relative strengths and weaknesses can we use them effectively.
But there's no simple rule or shortcut to developing this understanding. Benchmarks map out only a small and biased fraction of their capabilities, so it is up to the user to uncover the rest.
In this post I set out my mental model of LLMs and the heuristics I use to figure out when and how to use them effectively. I finish with some conclusions about what this implies about how they may evolve in 2025.