An introduction to auto-vectorization with LLVM

submited by
Style Pass
2024-07-04 15:30:02

Most modern general purpose CPUs have a vector processing unit (VPU). This unit contains vector registers that can hold multiple integers or floats, and instructions that operate on vector registers. Given two vector registers containing N floats each †, the VPU can perform a single instruction to operate on these two vector registers and store the result in another vector register. Without a VPU, this operation would require N instructions operating on two floating-point registers each relying on the floating-point unit (FPU), or arithmetic logic unit (ALU) in the case of integers.

Major instruction set architectures have vector extensions, and corresponding instructions for the VPU. Examples include x86's SSE/AVX, AArch64's SVE/Neon, and the latest entrant is RISC-V's RVV. In order to generate vector instructions, the programmer could adapt their code to use standardized vector intrinsics. To illustrate, suppose the programmer had the following source code:

As is evident, this process of hand-vectorizing code is very difficult and error-prone for non-trivial programs. The process is analogous to writing assembly instead of writing the source language and letting the compiler generate assembly. Indeed, compilers today auto-vectorize code, and for the purposes of illustration, let us use Clang/LLVM. First, the source language is lowered to target-independent LLVM IR by Clang, which is the frontend. Next, the vectorizers in the middle-end of LLVM operate on this LLVM IR without vectors in them, and produce LLVM IR with vectors in them. The vectorized LLVM IR is finally lowered to target-specific assembly by the backend of LLVM.

Leave a Comment