I was doing various performance optimizations in Pcompress for the last few days and the xxHash piece caught my eye. This non-cryptographic hash implementation is one of the fastest 32 bit hashes available today. I use it in Pcompress to hash data chunks for deduplication. The hash value is used to insert the chunk-id into a hashtable for lookups. xxHash is already extremely fast working at 5.4 GB/s on a fast processor however it’s inner loop looked very interesting.
There are 4 independent compute streams consuming data into 4 separate accumulators which are then combined into a single hash value later. These 4 independent but identical compute streams look like a perfect fit for vectorizing using the SSE instructions on modern x86 processors. The SSE registers can hold multiple packed integers or floats and a single operation, for example addition, applied to two SSE registers affect all the package values in parallel. The image below shows this behavior diagrammatically when the SSE registers are populated with 4 32-bit quantities.
An SSE register is 128-bit and can perfectly hold four 32-bit packed integers. I had to use at least SSE 4.1 to get the parallel multiply of packed 32-bit values. The rotate (XXH_rotl32) operation is easily doable with 2 bitwise shifts and a bitwise or (as is done in xxHash original code as well). Without having to deal with inline assembly language we can fairly easily use the compiler SSE intrinsics. These appear to the programmer as functions but in fact they result in specific CPU instructions being emitted. SSE intrinsics also provide the benefit of letting the compiler handle optimizations like instruction scheduling. Using inline assembly leaves all that optimization burden to the programmer which is often tricky to get right except for short segments. In addition some of the intrinsic functions are composite intrinsics. They result in emitting a collection of CPU instructions encapsulating some commonly used operation (like _mm_set_epi32(…)).