We enabled serving 100x larger recommender models at Pinterest by transitioning our machine learning serving from CPU to GPU — increasing Homefeed Pinner engagement by 16% through a step function improvement in model quality. In this blog post, we’ll share our optimizations to achieve this at neutral cost and latency, including optimizing individual ops, consolidating memory transfers, executing static graphs on-device through CUDA Graphs, and rethinking our distributed system setup.
Pinterest’s mission is to bring everyone the inspiration to create a life they love. To make our mission a reality, one of the key components in all of our product surfaces are various recommender models whose jobs are to predict the right content to show to the right person at the right time. Our recommender models are machine learning models that we trained using advanced algorithms to understand Pinners’ behavior as they spend time on our app. We serve our recommender models using our in-house machine learning model server (Scorpion Model Server, or SMS).
The technical challenges that we deal with for SMS are very difficult as it has to provide 400+ million Pinners relevant recommendations from a corpus of 300+ billion Pins in milliseconds. SMS performs machine learning inference on CPU and is heavily optimized over the years to fit our stringent latency and infrastructure cost requirements. We were pretty much at the limit of what SMS could do even with the latest generation of CPUs and had to be mindful that our modeling changes justify every latency and infrastructure cost increase.