As state-of-the-art large language models (LLMs) from companies like OpenAI, Anthropic, and Google continue to grow in size and capability, it'

Are you smarter than Phi-2?

submited by

Style Pass

2024-12-25 18:00:05

As state-of-the-art large language models (LLMs) from companies like OpenAI, Anthropic, and Google continue to grow in size and capability, it's easy to overlook the progress smaller models have made in recent months.

Microsoft's Phi-2, for instance, is a 2.7B parameter LLM that seems to punch far above its weight class. Its performance on the Massive Multitask Language Understanding (MMLU) benchmark, one of the most prominent LLM evaluation frameworks, is nothing short of impressive.

Data from Stanford's Holistic Evaluation of Language Models (HELM), which evaluated Phi-2 (among a plethora of other open- and closed-weight models) on the MMLU benchmark, shows that Phi-2 correctly answered 58.4% of all multiple-choice questions in the test set. These questions belong to 57 different categories, which range from abstract algebra to high school-level European history to professional medicine.

It's hard to contextualize what this number means, though. Without actually seeing questions from the MMLU dataset yourself, it's difficult to gauge just how good (or bad, for that matter) Phi-2's performance is.