Careful with Pair-of-Registers instructions on Apple Silicon

submited by
Style Pass
2024-04-29 06:30:05

Egor Bogatov is an engineer working on C# compiler technology at Microsoft. He had an intriguing remark about a performance regression on Apple hardware following what appears to be an optimization. The .NET 9.0 runtime introduced the optimization where two loads (ldr) could be combined into a single load (ldp). It is a typical peephole optimization. Yet it made things much slower in some cases.

Under ARM, the ldr instruction is used to load a single value from memory into a register. It operates on a single register at a time. Its assembly syntax is straightforward ldr Rd, [Rn, #offset]. The ldp instruction (Load Pair of Registers) loads two consecutive values from memory into two registers simultaneously. Its assembly syntax is similar but there are two destination registers: ldp Rd1, Rd2, [Rn, #offset]. The ldp instruction loads two 32-bit words or two 64-bit words from memory, and writes them to two registers.

Given a choice, it seems that you should prefer the ldp instruction. After all, it is a single instruction. But there is a catch on Apple silicon: if you are loading data from a memory that was just written to, there might be a significant penalty to ldp.

Leave a Comment