Chances are, you rely on them every day, but you may not realize that they are some of the most complex operations to accelerate with SIMD instruction

Over-engineering 5x Faster Set Intersections in SVE2, AVX-512, & NEON

submited by
Style Pass
2024-09-16 16:00:02

Chances are, you rely on them every day, but you may not realize that they are some of the most complex operations to accelerate with SIMD instructions. SIMD instructions make up the majority of modern Assembly instruction sets on x86 and Arm. They can yield 10x speedups, but due to their complexity, they are almost never used in production codebases.

Meet SimSIMD, which achieves up to 5x faster set intersections for sorted arrays of unique u16 and u32 values. With SimSIMD v5.3, we’ve broken new ground: it was one of the first libraries to support Arm’s Scalable Vector Extensions (SVE) and now leads the charge on SVE2.

This latest release uses instructions like HISTCNT and MATCH and compares their performance to Intel’s VP2INTERSECT in AVX-512, along with simpler methods like Galloping. These instructions are already live on AWS Graviton 4 CPUs and will soon power Nvidia’s Grace Hopper, Microsoft Cobalt, and Google Axios. So, let’s dive into the details and see how more software can get those 5x speedups.

Let’s say you have two sets of unique integers, like identifiers of different documents or tokens. You’d often store them in the form of a sorted array and intersected with:

Leave a Comment