Benchmarking division and libdivide on Apple M1 and Intel AVX512

submited by
Style Pass
2024-06-05 11:00:05

libdivide is fish’s library for speeding up integer division by runtime constants. It works by replacing divide instructions with cheaper multiplications and shifts.

libdivide supports SIMD, and recently gained support for AVX-512 and ARM NEON. Let’s benchmark on an Intel Xeon and Apple’s M1.

The times below measure the above loop with d=7 (a “slow case”) and count=524k. The loop is repeated 30 times, and the best (fastest) result is kept. 1

AVX512 is twice the width of AVX2 but only marginally faster. Speculatively, AVX512 processes multiplies serially, one 256-bit lane at a time, losing half its parallelism.

libdivide must use 32x32->64 muls (mul_epu32) for vectorized 64 bit, incurring the 4x penalty 2. We would expect to break even at AVX2, which has 4x parallelism; instead AVX2 is faster than scalar, perhaps because a 32 bit multiply is faster than 64 bit.

The hardware division time of 0.624 nanoseconds is a clean multiple of 3.2 GHz, suggesting a throughput of 2 clock cycles per divide. That is remarkable: hardware dividers are typically variable-latency and hard to pipeline, so 30+ clock cycles per divide is common. Does the M1 have more than one integer division unit per core?

Leave a Comment