OpenZFS 2.3.0 will be released any day now, and it includes the new “Fast Dedup” feature. My team at Klara spent many months in 2023 and 2024 working on it, and we reckon its pretty good, a huge step up from the old dedup as well as being a solid base for further improvements.
I’ve been watching various forums and mailing lists since it was announced, and the thing I kept seeing was people saying something like “it has the same problems as the old dedup; needs too much memory, nukes your performance”. While that was true (ish), and is now significantly less true, the real problem is that this just repeating the same old non-information that they probably heard from someone else repeating it.
I don’t blame anyone really; it is true that dedup has been extremely challenging to get the best out of, it’s very difficult to find good information about using it well, and “don’t use it” was and remains almost certainly the right answer. But, with this being the first time in almost two decades that dedup has been worth even considering, I want to get some fresh information out there about the what dedup is, how it worked traditionally and why it was usually bad, what we changed with fast dedup, and why it’s still probably not the thing you want.
When OpenZFS prepares to write some data to disk, if that data is already on disk, don’t do the write but instead, add a reference to the existing copy.