On AMD CPUs, the vaddps instruction (vectorized float addition) executes on ports 2 and 3. Meanwhile, the vfmadd132ps instruction (vectorized fused fl

CPU Ports & Latency Hiding on x86

submited by
Style Pass
2025-01-20 16:30:09

On AMD CPUs, the vaddps instruction (vectorized float addition) executes on ports 2 and 3. Meanwhile, the vfmadd132ps instruction (vectorized fused float multiply-add, or FMA) executes on ports 0 and 1. Since FMA is equivalent to simple addition when one of the arguments is 1, we can drastically increase the throughput of addition-heavy numerical kernels.

Every few years, I revisit my old projects to look for improvements and laugh at my younger self. Great experience! Totally recommend it!

While refactoring the mess that was my ParallelReductionsBenchmark, I found an interesting optimization opportunity. The throughput of an AVX-512 kernel running at 355 GB/s on a single AMD Zen 4 core on AWS can be further improved to reach 500 GB/s! Sadly, this doesn’t apply to Intel.

In the x86 world, CPU instructions are broken down into micro-ops, then dispatched to specialized ports for integer, floating-point, or load/store tasks. The cryptic notation, such as 1*p15+1*p23, shows how many micro-ops go to each port on different micro-architectures. This means the instruction is first dispatched to port 1 or 5, then continues on port 2 or 3.

Leave a Comment