We introduce Generate README Eval, a new benchmark to evaluate how well LLMs can generate README files from entire code repositories. Our results show

Evaluating code-to-readme generation using LLMs

submited by

Style Pass

2024-09-23 03:00:04

We introduce Generate README Eval, a new benchmark to evaluate how well LLMs can generate README files from entire code repositories. Our results show that Gemini-1.5 is the SOTA model on this benchmark.

In this blog post, we explore how to evaluate the whole source code repository to README.md file creation using LLMs. A README file is a popular way for developers to introduce users to their project and is used widely in open source projects on GitHub. We have created a new benchmark and dataset for README file generation called Generate README Eval.

We curate our dataset using open source projects on GitHub. We scan the top 1000 python projects on GitHub that have over 1000 stars and 100 forks. Then we filter those that have a README.md file and their repository content can fit in 100k tokens. This ensures that we can prompt the LLM in a single call using the content of the repository and ask it to generate the README.md file. Most frontier LLMs (Google, Anthropic, OpenAI and Llama-3.1 from Meta) support at least 128k tokens in context so keeping those repos under 100k should allow us to benchmark against all of them.

The script that generates the dataset is available here. We curated a total of 198 repositories and readmes. They were then randomly divided into the train (138) and test (40) split of the dataset. The train split is meant for fine-tuning and further analysis, while the test split of the dataset is meant to be used for evaluation.