FLUX.1 from Black Forest Labs is an amazing image generation model, but at a 23GB file size (for [dev]), loading the model into the GPU alone can be a big hassle. Specifically, loading a 23GB model on our AWS EC2 machine can take close to an entire minute.
This problem is sometimes referred to as the cold start problem. We at Outerport set out to solve cold starts because having to wait an entire minute to try out a new model or serve infrequent traffic to internal image generation tools is annoying.
Under the hood, this function call communicates with Outerport, which is a tensor memory management daemon that is built with Rust for performance. We optimize every step of the process so that even mundane operations like loading from storage and CPU-GPU memory transfers are 2-4x faster compared to naive PyTorch operations.
The best thing about this is that it's persistent- meaning that across different processes, the model weights can already be cached in CPU memory so that they can be loaded immediately into the GPU.