Execution Units are Often Pipelined

submited by
Style Pass
2024-12-27 09:30:05

In the context of out-of-order microarchitectures, I was under the impression that execution units remain occupied until the µop they’re processing is complete. This is often not the case.

As an example, take the Firestorm microarchitecture in the A14 and M1. It has two integer execution units capable of executing multiplies, which take three cycles to complete one multiplication.

As it turns out, many execution unit and µop combinations are heavily pipelined. This means that a µop can be issued to an execution unit while it’s still busy processing a different µop. So, on Firestorm that code sequence actually executes more like

In the limit, where the two execution units are constantly kept fed with multiplication µops, my original understanding would have predicted 1.5 cycles/instruction on average, when they in reality can sustain 0.5 cycles/instruction – each execution unit can be fed a new multiplication µop every cycle, and we have two of them.

Knowing this, I finally get why instruction latency and bandwidth tables specify reciprocal throughput; because it’s equivalent to cycles/instruction!

Leave a Comment