UTF-8 validation performance

submited by
Style Pass
2024-10-02 07:30:03

One of the things that GLib-based APIs got right from early on is that generally speaking, our API boundaries expect UTF-8. When transitioning between API boundaries, UTF-8 validation is a common procedure.

The implementation we’ve had is fairly straight forward to read and reason about. Though it is not winning any performance awards.

UTF-8 validation performance has been on my radar for some time. Back when I was working on VTE performance it was an issue there. Recently with GVariant I had to look for ways to mitigate that.

Björn Höhrmann wrote a branchless UTF-8 validator years ago which is the basis for many UTF-8 validators. Though once you really start making it integrate well with other API designs you’ll often find you need to add branches outside of it.

VTE uses a modified version of this currently for its UTF-8 validation. It allows some nice properties when processing streaming input like that of a PTY character stream.

Leave a Comment