Home • Projects • Essays • Babylonian Astronomy • Presentations, etc. • Contact
When an LLM is trained to be good at one task, will it also get better at other tasks? For example, if an LLM is trained to be good at formal logic, will it also be good at making informal logical arguments?
The most obvious way to investigate these questions would be to train several models on different training sets and test whether models trained for tasks in one domain can transfer skills to another. However, this would involve training many models, and would realistically mean training relatively small ones. Instead, I investigated these questions observationally by studying the population of large models that have been released to the public on platforms like HuggingFace.
The core question here is how performance on one task relates to performance on other tasks. For its LLM leaderboard, HuggingFace runs about 50 evaluations ("evals") on their library of over 1000 models, which makes it possible to study how performance on one eval relates to performance on another.