Deduplication vs compression and what's the difference between rolling chunker and fixed block size

Hey guys,

just wanted to share a survey I did for a software (IPFS) about rolling chunker vs fixed size chunking with different bounds/chunk sizes and compared it to light compression via zstd -1 --long for different data types:

Sidenote: Buzhash and Rabin are just two different algorithms to do rolling chunking and the default parameters for Buzhash are min: 128 KByte and max: 512 KByte. For rabin this is min: ~85 KByte, avg: 256 KByte and max: 384 KByte.

TLDR:

  • Rolling chunking is great!
  • Choose a good compromise between overhead of small chunks and too large chunks to find similarities
  • good options were 12K-24K-48K and 16K-32K-64K as well as 128K-512K (due to the much lower overhead in areas where no deduplication can occur)
  • Sufficiently small static chunks can also work, as most data might be aligned to some arbitrary layout like 4K or 8K.
  • Duplicacy uses much larger chunks, and thus may perform differently
  • Compression over several versions of data with zstd is most likely better. But deduplication with a rolling chunker can eliminate some redundancies before compression takes place.
  • Compression in a backup application is usually not capable of finding those redundancies due to large data size and compression being limited to individual chunks or versions of a file.