Pinterest is a visual platform at its core, so the need to understand and act on images is paramount. A couple of years ago, the Content Quality team

Detecting Image Similarity in (Near) Real-time Using Apache Flink

submited by
Style Pass
2021-06-29 21:00:12

Pinterest is a visual platform at its core, so the need to understand and act on images is paramount. A couple of years ago, the Content Quality team designed and implemented our own batch pipeline to detect similar images. The similarity signal is widely used at Pinterest for use cases varying from improving recommendations based on similar images to taking down spam and abusive content. However, it was taking several hours for the signal to be computed for newly created images, which was a long window for spammers and abusers to harm the platform. So recently, the team implemented a streaming pipeline to detect similar images in near-real-time.

Given the platform’s scale, identifying duplicate images has been difficult, and doing it in real-time is even more challenging. This blog post focuses on the work the Content Quality team did recently to leverage Apache Flink to detect duplicate images in (near) real-time.

The project’s goal was to reduce the latency to sub-seconds instead of the hours-long latency the batch pipeline takes without compromising accuracy and coverage.

Leave a Comment