AI models that act as programming assistants still hallucinate a lot. Commercial models partially fabricate the content of code packages in 5.2 percent of cases. For open-source models, the figure is as high as 21.7 percent, according to research conducted by three American universities.
The scientists from the University of Texas, the University of Oklahoma, and Virginia Tech examined 16 LLMs widely used for code generation. Using the npm and PyPI package repositories, respectively, they generated 576,000 pieces of code in JavaScript and Python .
They ran thirty tests that resulted in 2.23 million packages—almost twenty percent of these, 440,445 packages, involved hallucinations. In addition to the hallucinations in the code, the LLMs made up entire package names in 205,474 unique cases. These did not exist at all in the repositories used.
A bright spot is that according to this research, the models hallucinated less than measured in an earlier study by Lasso Securities. In the case of GPT-4, it is 5.76 percent versus 24.2 percent. For GPT-3.5, the difference is 4.05 percent versus 22.22 percent. (In the paper, the Lasso are listed the other way around, on this page, they are correct).