submited by

Style Pass

The x86 instruction set has a somewhat peculiar set of SIMD integer multiply operations, and Intel’s particular implementation of several of these operations in their headline core designs has certain idiosyncrasies that have been there for literally over 25 years at this point. I don’t actually have any inside information, but it’s fun to speculate, and it gives me an excuse to waffle about multiplier designs, so here we go!

x86 doesn’t have explicit SIMD integer operations before MMX, which first showed up in – no big surprise – the Pentium MMX. Said Pentium MMX offers exactly three SIMD integer multiply instructions, and all three of them originally took 3 cycles (fully pipelined).

The first and most basic one is PMULLW, “packed multiply low word”, which interprets its two 64-bit MMX register operands as containing four words (which in x86, if you’re not familiar, means 16-bit integers) each. The corresponding lanes in both operands are multiplied and the low 16 bits of the result written to the corresponding lane of the destination. We don’t need to say whether these integers are interpreted as signed or unsigned because for the low 16 bits, it doesn’t matter. In short, it’s a basic element-wise multiply working on 16-bit ints.

The second available integer multiply is PMULHW, “packed multiply high word”. Again, we multiply 16-bit lanes together, which (in general) yields a 32-bit product, and this time, we get the top 16 bits of the result. This time, we need to make up our mind about whether the integers in question are considered signed or unsigned; in this case, it’s signed. A fun fact about “high multiply” type operations (which exist in a lot of instruction sets) is that there’s no practical way (at least, not that I’m aware of) to compute just those high bits. Getting those high bits right generally means computing the full product (in this case, 32 bits per lane) and then throwing away the bottom half. Therefore, a datapath that can support both types of multiplies will usually end up having a full 16×16->32-bit multiplier, compute all product bits, and then throw half of them away in either case.

Read more fgiesen.word...