As a side-quest I recently decided to write a branchless utf8 decoder utilizing the pext or "parallel extract" instruction. It's compliant w

Decoding UTF8 with Parallel Extract

submited by
Style Pass
2024-05-05 09:30:02

As a side-quest I recently decided to write a branchless utf8 decoder utilizing the pext or "parallel extract" instruction. It's compliant with rfc-3629, meaning that it doesn't just naively decode the code-point but also checks for overlong encoding, surrogate pairs and such. Compiled with gcc -O3 -march=x86-64-v3 the entire decoder results in just about 29 instructions. That's pretty sweet, if you ask me.

The bits marked with x are the important bits that we need to extract out. We also need to validate the input to make sure the continuation markers exists, check against overlong encoding and surrogate pairs.

The decoder interface is the following. You give it a buffer, length and a int pointer to accumulate errors into. The function returns back a struct which contains the decoded codepoint and the utf8 encoded length of it. If an error occurs, err will be non-zero and the return value will be ill-formed.

One interesting choice I've made is to make the err field sticky across function calls. This means that the decoder doesn't clear it to zero when called, the caller is responsible for doing it. This allows you to check for error - once - at the end of your loop, if that's appropriate.

Leave a Comment