In database land, most databases open(2) their WAL and data files with O_DIRECT so that write(2) /writev(2) /pwritev(2) perform unbuffered IO, maintain their own page cache, and utilize fdatasync() for durability. Doing so gives the most control over what data maintained in the page cache, allows directly modifying cached data, and using O_DIRECT then skips the kernel’s page cache when reading and writing data from or to disk. O_SYNC/O_DSYNC allow a single write() with O_DIRECT to be equivalent to a write() followed by a fsync()/fdatasync(). In the Linux world, the existence of O_DIRECT is surprisingly controversial, and Linus has some famous rants on the subject illustrating the OS/DB world view mismatch.
There are some notable examples of databases that rely on buffered IO and the kernel page cache (e.g. RocksDB, LMDB). Relying on the kernel’s page cache can be polite in the context of an embedded database meant to be used within another application and co-exist with many other applications on a user’s computer. Leaving the caching decisions to the kernel means that more memory for the page cache can be easily granted when the system has the memory to spare, and can be reclaimed when more available memory is needed. If using buffered IO, preadv2/prwritev2’s flags can be helpful. pwritev2() has also gained support for multi-block atomic writes, which is conditional on filesystem and drive support[1]. [1]: Drive support means a drive for which Atomic Write Unit Power Fail (awupf) in nvme-cli id-ctrl returns something greater than zero. I’ve never actually seen a drive support this though.
Directly invoking write() performs synchronous IO. Event-driven non-blocking IO is increasingly popular for concurrency and due to thread-per-core architectures, and there’s a number of different ways to do asynchronous IO in linux. Prefer them in the following order: io_uring(7) > aio(7) > epoll(7) > select(2) . Support for the number of operations one can issue asynchronously decreases rapidly the further one gets from io_uring. For example, io_uring supports an async fallocate(2) , but aio doesn’t; aio supports async fsync(), and epoll doesn’t. A library which issues synchronous filesystems calls on background threads, like libeio, will be needed to fill in support where it’s missing. For the utmost performance, one can use SPDK, but its usage is particularly not friendly. Understanding Modern Storage APIs[2] has a nice comparison of SPDK vs io_uring vs aio. [2]: Diego Didona, Jonas Pfefferle, Nikolas Ioannou, Bernard Metzler, and Animesh Trivedi. 2022. Understanding modern storage APIs: a systematic study of libaio, SPDK, and io_uring. In Proceedings of the 15th ACM International Conference on Systems and Storage (SYSTOR '22), Association for Computing Machinery, New York, NY, USA, 120–127. [scholar]