Deduplication vs compression and what's the difference between rolling chunker and fixed block size

cyrond · 7 May 2022 14:24

Hey guys,

just wanted to share a survey I did for a software (IPFS) about rolling chunker vs fixed size chunking with different bounds/chunk sizes and compared it to light compression via zstd -1 --long for different data types:

Sidenote: Buzhash and Rabin are just two different algorithms to do rolling chunking and the default parameters for Buzhash are min: 128 KByte and max: 512 KByte. For rabin this is min: ~85 KByte, avg: 256 KByte and max: 384 KByte.

cyrond · 7 May 2022 14:58

TLDR:

Rolling chunking is great!
Choose a good compromise between overhead of small chunks and too large chunks to find similarities
good options were 12K-24K-48K and 16K-32K-64K as well as 128K-512K (due to the much lower overhead in areas where no deduplication can occur)
Sufficiently small static chunks can also work, as most data might be aligned to some arbitrary layout like 4K or 8K.
Duplicacy uses much larger chunks, and thus may perform differently
Compression over several versions of data with zstd is most likely better. But deduplication with a rolling chunker can eliminate some redundancies before compression takes place.
Compression in a backup application is usually not capable of finding those redundancies due to large data size and compression being limited to individual chunks or versions of a file.