TLDR: I tagged and described 200/400 ARC training dataset tasks, merged it with evaluation data for some LLMs, did some analysis on it, and put it all

Exploring LLM performance on the ARC training dataset

submited by

Style Pass

2024-11-15 17:00:09

TLDR: I tagged and described 200/400 ARC training dataset tasks, merged it with evaluation data for some LLMs, did some analysis on it, and put it all on a site so you can explore it.

ARC is a hard problem, and exploring the dataset first felt like the obvious starting point. The site originally started as a way for me to just explore and take notes on the problems, but that turned into completing a taxonomy of tags and descriptions and a few other things I got carried away with.

You can explore the tagged problems here yourself, download the raw JSON file, or read on if you want some more details on the dataset, how I went about this and some basic analysis on it.

If you are interested in helping to complete the tagging and descriptions for the remaining 200 problems, please reach out to me on twitter, or you can contibute directly to it here.

My main objective with this project was to understand where LLMs perform well [0] and where they perform poorly. Thankfully there are notebooks for GPT-4, Claude Sonnet, as well as some of the other public leaderboard submissions such as Icecuber 2020.