Locale Cooking: Common Scenarios and Suggestions

submited by
Style Pass
2024-12-02 20:00:16

We’ve gone through a lot of detail about locales and collations here, but what should you do when it is time to set up a database? Here is a cookbook with some common scenarios, with recommendations.

The built-in C.UTF-8 locale collates text based on the Unicode codepoints. You won’t get broken or invalid characters (as you would with the C/POSIX locale), but sortation will be incorrect for some languages. However, this gets you a good combination of reliable (if sometimes whacky) sort order, and very good performance. You also avoid any issues with underlying locale provider libraries changing.

C/POSIX locale just uses the C standard library function strcmp, so it does byte by byte comparisons, ignoring encoding entirely. This also avoids issues with locale provider libraries changing. If you do put in non-7-bit-ASCII characters, the sortation will be completely whacky, so don’t do that.

This provides reasonable collation order for most languages, and is (usually) fast compared to libc locales (except POSIX). The icu libraries also can change in a breaking fashion less often than libc, although either one can change in an unfortunate way.

Leave a Comment