FastCDC: A Fast and Efficient Content-Defined Chunking Approach for Data Deduplication

submited by
Style Pass
2024-10-19 06:30:03

Wen Xia, Huazhong University of Science and Technology and Sangfor Technologies Co., Ltd.; Yukun Zhou, Huazhong University of Science and Technology; Hong Jiang, University of Texas at Arlington; Dan Feng, Yu Hua, Yuchong Hu, Yucheng Zhang, and Qing Liu, Huazhong University of Science and Technology

Content-Defined Chunking (CDC) has been playing a key role in data deduplication systems in the past 15 years or so due to its high redundancy detection abil- ity. However, existing CDC-based approaches introduce heavy CPU overhead because they declare the chunk cut- points by computing and judging the rolling hashes of the data stream byte by byte. In this paper, we pro- pose FastCDC, a Fast and efficient CDC approach, that builds and improves on the latest Gear-based CDC ap- proach, one of the fastest CDC methods to our knowl- edge. The key idea behind FastCDC is the combined use of three key techniques, namely, simplifying and enhanc- ing the hash judgment to address our observed challenges facing Gear-based CDC, skipping sub-minimum chunk cut-point to further speed up CDC, and normalizing the chunk-size distribution in a small specified region to ad- dress the problem of the decreased deduplication ratio stemming from the cut-point skipping. Our evaluation results show that, by using a combination of the three techniques, FastCDC is about 10x faster than the best of open-source Rabin-based CDC, and about 3x faster than the state-of-the-art Gear- and AE-based CDC, while achieving nearly the same deduplication ratio as the clas- sic Rabin-based approach.

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

Leave a Comment