Log-structured file systems: There's one in every SSD

submited by

Style Pass

2024-12-26 03:00:27

What is a log-structured file system? Log-structured file systems, oddly enough, evolved from logging file systems. A logging (or journaling) file system is a normal write-in-place file system in the style of ext2 or FFS, just with a log of write operations bolted on to the side of it. (We'll use the term "journaling file system" in the rest of the paper to avoid confusion between "logging" and "log-structured" file systems.) A journaling file system keeps the on-disk state of the file system consistent by writing a summary of each write operation to the log, stored somewhere non-volatile like disk (or NVRAM if you have the money), before writing the changes directly to their long-term place in the file system. This summary, or log record, contains enough information to repeat the entire operation if the direct write to the file system gets interrupted mid-way through (e.g., by a system crash). This operation is called replaying the log. So, in short, every change to the file system gets written to disk twice: once to the log, and once in the permanent location.

Around 1988, John K. Ousterhout and several collaborators realized that they could skip the second write entirely if they treated the entire file system as one enormous log. Instead of writing the operation to the log and then rewriting the changes in place somewhere else on the disk, it would just write it once to the end of the log (wherever that is) and be done with it. Writes to existing files and inodes are copy-on-write - the old version is marked as free space, and the new version is written at the end of the log. Conceptually, finding the current state of the file system is a matter of replaying the log from beginning to end. In practice, a log-structured file system writes checkpoints to disk periodically; these checkpoints describe the state of the file system at that point in time without requiring any log replay. Any changes to the file system after the checkpoint are recovered by replaying the relatively small number of log entries following the checkpoint.