LLMs are fast-changing the way that we write software. Over a million developers now pay for GitHub Copilot and recent breakthroughs in LLM reasoning

Evaluating LLMs on COBOL

submited by
Style Pass
2024-03-30 11:30:04

LLMs are fast-changing the way that we write software. Over a million developers now pay for GitHub Copilot and recent breakthroughs in LLM reasoning have brought the dream of a fully AI Software Engineer closer to reality. But while it’s not hard to find a demo of an LLM coding a website or a clone of Flappy Bird, not much is known about their ability to write code in older ‘legacy’ languages like COBOL.

The opportunity for LLM COBOL generation is huge. Although the language was first released in 1959, it continues to power critical systems - 95% of US ATM transactions are processed in COBOL. But it's not taught in computer science courses or bootcamps, and the engineers who write it professionally are steadily retiring. If LLMs could understand and write COBOL they could help maintain the 800 billion lines still in production today.  

Today we’re releasing COBOLEval, the first evaluation benchmark for LLM code completions in COBOL. It consists of 146 challenging coding problems that have been converted into COBOL from the widely-used HumanEval Python generation benchmark. Each problem is paired with an average of 6 test cases. An LLM-generated solution has to pass all of them to be correct. We’re also releasing a test harness that you can use to evaluate your own models, as well as mAInframer-1 - a series of open-source models based on CodeLlama that we’ve fine-tuned specifically to write COBOL - which outperform GPT-4.

Leave a Comment