It’s a tale as old as time -- from the early days of Caffe to latest frameworks such as JAX, CUDA throwing Out Of Memory (OOM) errors has always existed. With increasing model sizes, and growing heterogeneity in hardware with different memory limits, making sure your model does not OOM is a delicate balance of tinkering with the magic combination of batch size, gradient accumulation steps, and number of devices.
Now, with Composer, you’ll rarely have to worry about CUDA out of memory exceptions again. Introducing automatic gradient accumulation:
a simple but useful feature with which we will automatically catch CUDA Out of Memory errors during training, and dynamically adjust the number of gradient accumulation steps to stay within your hardware’s available memory.
With this simple flag, you can now train on any device, any number of devices, modify your batch sizes, or apply some of our algorithms, without fear of CUDA OOM. This obviates the tedious job of dialing in the gradient accumulation settings through trial and error!