TL;DR: Traditional foundation model training approaches require substantial manual interaction, little signal for improvement, and slow iteration times. To resolve these issues, we built the Model Factory, poolside’s internal systems framework for quickly training, scaling, and experimenting with novel foundation models. In this post, we share our methodology and reasoning for building the Model Factory.
The world of AI is in the midst of a Cambrian explosion: new models, ideas, and techniques surface daily, and keeping up has become a full-time job. New ideas are developed based on intuition, deployed on supercomputer-scale clusters and pushed into production long before the associated theory has caught up. This pace of advancement presents a serious challenge to companies pursuing AGI—namely, how do we scale our ability to evaluate, adopt, and deploy these innovations fast enough to stay ahead?
Organizations typically scale along two axes: scaling with people, and scaling with engineering. Hiring top-tier engineers and researchers can push the boundaries of what is possible, but additional engineers can only scale productivity linearly, at most. The pace of AI, by comparison, is growing at least exponentially; linear growth just will not cut it. In other words, hiring ever more engineers and researchers would not allow us to stay ahead without substantial productivity improvements. This requirement clashes with the traditional, linear approach to training models whereby substantial amounts of engineering time was typically dedicated to manually handling training runs.