The AI world is buzzing with the power of large generative neural networks such as ChatGPT, Stable Diffusion, and more. These models are capable of impressive performance on a wide range of tasks, but due to their size and complexity, only a handful of organizations have the ability to train them. As a consequence, access to these models can be restricted by the organization that owns them, and users have no control over the data the model has seen during training.
That’s how we can help: at MosaicML, we make it easier to train large models efficiently, enabling more organizations to train their own models on their own data. As shown in a previous blog post, our StreamingDataset library, our training framework Composer, and our MosaicML Cloud platform significantly simplify the process of training large language models (LLMs). For this blog post, we used that same process to measure the time and cost to train a Stable Diffusion model from scratch. We estimated an upper-bound of 79,000 A100-hours to train Stable Diffusion v2 base in 13 days on our MosaicML Cloud platform, corresponding to a total training cost of less than $160,000. This is a 2.5x reduction in the time and cost reported in the model card from Stability AI. In addition to saving time and money, our Streaming, Composer, and MosaicML Cloud tools make it dead-simple to set up and scale Stable Diffusion training across hundreds of GPUs without any additional effort. The code we used for this experiment is open-source and ready to run; check it out here! And if you’re interested in training diffusion models yourself on the MosaicML Cloud, contact us for a demo.
Table 1 and Figure 1 below illustrate how the Stable Diffusion V2 base training time and cost estimates vary by the number of GPUs used. Our final estimate for 256 A100s is 12.83 days to train with a cost of $160,000, a 2.5x reduction in the time and cost reported in the Stable Diffusion model card. These estimates were calculated using measured throughput and assumed training on 2.9 billion samples. Throughput was measured by training on 512x512 resolution images and captions with a max tokenized length of 77. We scaled GPUs from 8 to 128 NVIDIA 40GB A100s, then extrapolated throughput to 256 A100s based on these measurements.