This is part 2 of a 3-part blog series about how we’ve improved the way CockroachDB stores and modifies data in bulk (here is part 1 in case you missed it). We went way down into the deepest layers of our storage system, then up to our SQL schema changes and their transaction timestamps - all without anybody noticing (or at least we hope!)
Bulk ingestions are used to write large amounts of data with high throughput, such as imports, backup restoration, schema changes, and index backfills. These operations use a separate write path that bypasses transaction processing, instead ingesting data directly into the storage engine with highly amortized write and replication costs. However, because these operations bypass regular transaction processing, they have also been able to sidestep our normal mechanisms for preserving MVCC history.
The Pebble storage engine uses an LSM (log-structured merge tree) to store key/value pairs. Briefly, data is stored in 7 levels (L0 to L6), with more recent data in the higher levels. Each level is roughly 10 times larger than the preceding level, and as data ages it is moved down the levels through a process known as compaction. Notably, each level is made up of a set of SST (sorted string table) files: immutable files that contain a set of key/value pairs sorted by key. Normally, as keys are written to Pebble, they are buffered in memory and then flushed out to disk as a new SST file in L0. A compaction will eventually read all of the SSTs in L0, merge and sort their key/value pairs, write them out to a new SST file in L1, and so on down the levels (this avoids having to search across too many SST files when reading, i.e. read amplification).