Efficient In-Place UTF-16 Unicode Correction with ARM NEON

submited by
Style Pass
2024-12-29 18:30:03

UTF-16 is an encoding system used by several platforms and applications to represent Unicode characters. Notably, Microsoft Windows employs UTF-16 for internal operations, file names, and registry keys, while Java and JavaScript use it for string representation.

The replace_invalid_utf16 function scans through a buffer of char16_t characters, ensuring that any high surrogate is followed by a low surrogate to form a valid pair; if not, or if a low surrogate appears without a preceding high surrogate, it replaces the invalid character with the Unicode replacement character (U+FFFD), effectively correcting the UTF-16 encoding in place. The function should be reasonable efficient.

Most of our processors have instructions able to process registers with eight 16-bit words per register. Most mobile processors today are 64-bit ARM processors with powerful ARM NEON instructions.

We can write a function targeting ARM NEON using intrinsic functions. These are special functions that given us low level access to the unique functionality of ARM NEON. There are comparable intrinsic functions for other processor families such as Intel/AMD, RISC-V, Loonson and so forth.

Leave a Comment