Say What You Mean: A Response to 'Let Me Speak Freely'

submited by
Style Pass
2024-11-24 08:00:02

A recent paper from the research team at Appier, Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models, made some very serious accusations about the quality of LLM evaluation results when performing structured generation. Their (Tam, et al.) ultimate conclusion was:

The source for this claim was three sets of evaluations they ran that purportedly showed worse performance for structured generation (”JSON-Mode” in the chart) compared with unstructured (”Natural Language”/NL in the chart). This chart (derived and rescaled from the original charts) shows the concerning performance:

We here at .txt have always seen structured generation outperform unstructured generation in our past experiments. Our past experiments were on problems with clear, LLM-compatible structure, but so were the tasks that Tam, et al focused on (in fact, we had already done a similar experiment with GSM8K using different models). So these results from Tam, et al were as surprising as they were concerning.

After revisiting the above tasks with the same model (Llama-3-8B-instruct) we found that our results did not match those found in the paper, and reflected what we have previously seen. Diving deeper into the data and source code for the paper, we have determined there are several critical issues that led the authors to a fundamentally flawed conclusion.

Leave a Comment