submited by

Style Pass

The recent release of the functional MATH() dataset came with a paper and a twitter thread that headlined with an impressive sounding claim:

The thread goes on to explain although many popular Large Language Models (LLMs) seem to be able to solve complex math problems, those abilities get significantly reduced when you make superficial changes to those problems. These small changes are meant to affect the numerical answer without changing the underlying reasoning needed to solve the problem. The poor results on the modified problems suggests that LLMs aren’t learning general techniques, but instead have memorised specific solutions. The authors call this difference in ability a “reasoning gap”.

Since LLMs are trained on massive quantities of data extracted from the Internet this claim is entirely plausible. Math problems from test sets could appear anywhere online and leak into the training data. The reasoning capabilities of LLMs seem like a key forward step in AI development, so MATH() is addressing an important question. However, having looked at the data itself, I think there are some issues with the approach, and, for now we should take these results with a grain of salt.

MATH() is a system for synthetically generating data, based on a static dataset called MATH. (As in the joke about the panda who eats, shoots and leaves, the extra punctuation makes the difference.) The original MATH dataset is a collection of maths problems with human-written solutions, including explanations of reasoning, intended for training and evaluating language models. To create MATH(), the authors wrote data generating code that used the static problems as templates. By replacing values in a MATH problem with random numbers, and then updating the resulting calculations, you can create endless new variations.

Read more thomasvoice....