On AMD CPUs, the vaddps instruction (vectorized float addition) executes on ports 2 and 3. Meanwhile, the vfmadd132ps instruction (vectorized fused float multiply-add, or FMA) executes on ports 0 and 1. Since FMA is equivalent to simple addition when one of the arguments is 1, we can drastically increase the throughput of addition-heavy numerical kernels.
Every few years, I revisit my old projects to look for improvements and laugh at my younger self. Great experience! Totally recommend it!
While refactoring the mess that was my ParallelReductionsBenchmark, I found an interesting optimization opportunity. The throughput of an AVX-512 kernel running at 355 GB/s on a single AMD Zen 4 core on AWS can be further improved to reach 500 GB/s! Sadly, this doesn’t apply to Intel.
In the x86 world, CPU instructions are broken down into micro-ops, then dispatched to specialized ports for integer, floating-point, or load/store tasks. The cryptic notation, such as 1*p15+1*p23, shows how many micro-ops go to each port on different micro-architectures. This means the instruction is first dispatched to port 1 or 5, then continues on port 2 or 3.