#dynamically sized type #german string #optimization #reference counting #shared ownership #short string #small string #umbra string #unsafe
As I’m studying database systems for my Master’s degree in Germany, this article Why German Strings are Everywhere immediately caught my attention. I was pretty excited to learn it’s the string data structure described in the paper Umbra: A Disk-Based System with In-Memory Performance, which had been introduced to me through another paper about its storage engine - LeanStore: In-Memory Data Management Beyond Main Memory. I was even more interested to learn that it has been implemented in many different data solutions like DuckDB, Apache Arrow, and Polars.
This string implementation also allows for the very important “short string optimization”: A short enough string can be stored “in place,” i.e., we set a specific bit in the capacity field, and the remainder of capacity, as well as size and ptr, become the string itself. This way, we save on allocating a buffer and a pointer dereference each time we access the string. An optimization that’s impossible in Rust, by the way ;).