Using GCC 10.1.0 on x64 GNU/Linux, no matter if it's using optimization of -O2 or un-optimized, __int128_t is always a little faster than long long. I

Why is __int128_t faster than long long on x86-64 GCC?

submited by
Style Pass
2021-05-22 17:30:06

Using GCC 10.1.0 on x64 GNU/Linux, no matter if it's using optimization of -O2 or un-optimized, __int128_t is always a little faster than long long.

Indeed, on my system as well as on godbolt, sizeof(long long) = 8 and sizeof(__int128_t) = 16. Thus operation on the former are performed by native instruction while not the latter (since we focus on 64 bit platforms). Additions, multiplications and subtractions are slower with __int128_t. But, built-in functions for divisions/modulus on 16-byte types (__divti3 and __modti3 on x86 GCC/Clang) are surprisingly faster than the native idiv instruction (which is pretty slow, at least on Intel processors).

If we look deeper in the implementation of GCC/Clang built-in functions (used only for __int128_t here), we can see that __modti3 uses conditionals (when calling __udivmodti4). Intel processors can execute the code faster because:

Please note that the performance of the two implementations can highly differ from one architecture to another (because of the number of CPU ports, the branch prediction capability and the latency/througput of the idiv instruction). Indeed, the latency of a 64-bit idiv instruction takes 41-95 cycles on Skylake while it takes 8-41 cycles on AMD Ryzen processors for example. Respectively the latency of a div is about 6-89 cycles on Skylake and still the same on Ryzen. This means that the benchmark performance results should be significantly different on Ryzen processors (the opposite effect may be seen due to the additional instructions/branch costs in the 128-bits case).

Leave a Comment