It’s easy to answer this question with a benchmark, and we will get to that soon, but for now let’s take some guesses (also known as “reasoning

Incrementing Vectors

submited by
Style Pass
2021-08-22 00:00:06

It’s easy to answer this question with a benchmark, and we will get to that soon, but for now let’s take some guesses (also known as “reasoning from first principles” if you want to pretend there is a lot of science in it).

Then knowing the hardware we’re testing on, Intel Skylake, we can check the characteristics of 8-bit and 32-bit add immediate instructions. It turns out that their primary performance characteristics are identical: 1 per cycle throughput and 4 cycles latency through memory1. In this case, we don’t expect latency to matter since each add is independent, so we might hope this runs at 1 cycle per element, if the rest of the loop overhead can run in parallel.

We might also note that 20,000 elements means a working set of 20 KB for the uint8_t vector, but 80 KB for the uint32_t vector. The former fits cleanly in the L1 cache of modern x86 machines, while the second doesn’t. So maybe the 8-bit version will have a leg up due to caching effects?

Finally, we might note that this problem seems like a textbook case for auto-vectorization: a loop with a known number of iterations doing an arithmetic operation on adjacent elements in memory. In that case, we might expect the 8-bit version to have a huge advantage over the 32-bit version, since every vector operation will process 4 times as many elements, and in general, on Intel, the byte element vector operations have the same performance as their 32-bit element counterparts.

Leave a Comment