In this blog post, and probably one or more following ones, I want to discuss how collations in PostgreSQL work internally. See also this previous post about the work we have done for collations in PostgreSQL 15. And there is even more coming together in PostgreSQL 16 right now, which we will talk about in the future.
Sorting is an important functionality in a database system. Sorting happens internally, for example in B-tree indexes and during merge joins, and externally when the user requests it.
There are two major aspects to sorting. First, the sort algorithm, how you put the values into sorted order in an efficient way, considering time and space requirements. This is a classical computer science problem. I’m not going to talk about that here. Second, how to decide what the sort order should be. For many kinds of data this is obvious, for example for numbers and date/time values. In the context of character strings, this process is called collation and is a bit more complicated.
Of course, there is some kind of base order of A, B, C, … that is mostly(?) agreed upon. But there are a variety of well-known and lesser-known complications: how to handle case differences, accents, punctuation, whitespace, variations in different languages, variations within the same language, data in multiple languages, and more.