Recently, Mark Raasveldt of DuckDB wrote an excellent post about why memory management is crucial for efficient data processing. In his post, he focus

The Hidden Cost of Data Movement

submited by
Style Pass
2024-09-24 16:31:19

Recently, Mark Raasveldt of DuckDB wrote an excellent post about why memory management is crucial for efficient data processing. In his post, he focuses on the cost of having data on disk and moving it to memory. After all, everyone knows that having data in memory is what you want. As Jim Gray famously said in 2006:

Unfortunately not, because compute has come a long way since 2006. RAM locality may have been king in 2006. But the king is dead, long live the king. While memory locality was good enough for 2006, the CPU only performs operations on registers, making register locality the new king in town. The fact that CPUs have megabytes of caches whose sole purpose is to bridge the distance between memory and registers shows how far the distance between CPU and memory really is.

So what else do you need to consider when simply having your data in memory is not enough to guarantee efficient access to it? I want to use Mark’s post as motivation to dive deeper into the topic of data movement during data processing and answer this question. We’ll explore this topic for everyday algorithms, as well as what it means for a database system. You can find the code for all experiments in this post at the bottom of this page, or in our examples repository.

Leave a Comment