One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.
The eponymously distasteful take-away of Richard Sutton’s essay has often been misconstrued: because scale is all you need, they say, smaller models are doomed to irrelevance. The rapid increase in model size above one trillion parameters and the technological limitations of GPU memory together seemed to foreclose on economical frontier intelligence anywhere except at an oligopoly of intelligence-as-a-service providers. Open models and self-serve inference were in retreat.
But as the quote above indicates, there are in fact two arrows in the scaling quiver: learning and search. Learning, as we do it now with neural networks, scales with memory at inference time — larger models perform better, ceteris paribus, because they can extract more data from their training set into more circuits and more templates. Search scales smoothly with compute at inference time — compute that can be spent on either producing higher quality candidates or on producing more candidates. In the ideal case, the scaling behavior can be predicted via so-called scaling laws.