If you’re interested in large language models, then you should care about their ability to do arithmetic. On the road to AGI, arithmetic problems provide a neat microcosm of more general multi-step reasoning problems. Here’s why:
Arithmetic problems can be solved by simple algorithms, rules that must be consistently applied. You can arbitrarily increase the difficulty of these arithmetic problems, i.e. the number of times that a rule must be applied.
This means that arithmetic provides good windows into both single-step and multi-step (i.e. chain-of-thought) reasoning tasks. We can evaluate both individual calls to LLMs, as well as compositions of such calls.
State-of-the-art LLMs still struggle to do simple arithmetic problems, even while LLMs have scaled up dramatically in size and standard evaluation benchmarks.
In this article, I try to summarize everything that we know about LLMs doing arithmetic. I’ll point you to all the interesting papers that I’m aware of, and draw some observations of my own.