I’m talking about AI progress in the past year again. Contrary to the expectations that GPT-4 set in 2023, research labs emphasized the “capabilities” of the models they released this year over their scale. One example is “long context,” which is the ability to effectively process longer inputs, and looks a lot like “memory.” Models that handle over a million tokens of context, like Gemini (February) and Claude (March), can find the timestamp of a movie scene from just a line drawing. Another capability is native “multimodality,” generally available since Gemini 1.5 (February), Claude 3 (March), and GPT-4o (May). Multimodal models input and output text, images, and audio interchangeably, a capability that we already take completely for granted. Sora (February) and Veo 2 (December) developed video as a nascent modality. We’re just discovering the right interfaces, in the absolutely magical Advanced Voice Mode (May) and Multimodal Live (December). A third capability is “reasoning,” the ability to use more resources at inference time to produce a better answer. The best example is OpenAI’s o1 (September) and o3 (December), which performed as well as the best humans in the world in math and coding tasks. Designed to reason in a formal language, DeepMind’s AlphaProof (July) came one point short of gold in the International Mathematical Olympiad. Finally, “agency,” the ability to act in an environment, wrapped up the year of capabilities, with Anthropic’s computer use (October) and DeepMind’s Mariner (December), which I worked on.
By any account this is another stellar year of AI progress. The field even won two Nobel prizes. So while the models didn’t seem to get much bigger, pronouncements about them have. Sam Altman claimed that we may be a few thousand days from superintelligence (September), and Dario Amodei posited infectious diseases, cancers, and mental illnesses as problems AI may soon eradicate, making a century of scientific progress in the next decade (October). Whether it’s Altman’s “Intelligence Age” or Amodei’s “compressed 21st century,” what gives them the confidence to make these predictions are capabilities. If scaling pre-training is flagging, a distinct new capability will pick up the slack. For example, reasoning promises to be orders of magnitude more compute efficient, so reaching the next level of performance won’t be prohibitively expensive. Or, agents may impress not because they are fundamentally smarter, but because we “unhobbled” them to their full potential.