Deduplicaction and Variable vs Fixed size chunk size

skidvd · 13 July 2018 14:15

Hi all,

As my testing continues, I’m wrestling with chunk size options and would appreciate some advice please. I’ve consulted chunk-size wiki as well as the other discussions at 334 and system design & performance issues.

As deduplication is one of my primary goals, I’m also following PR 456 with interest - especially in combination with a future advanced prune concept mentioned there.

From these, is seems that choosing a smaller (perhaps 1M) chunk size may tend to increase deduplication (one of my primary goals and Duplicacy’s main design goals). It also seems clear that choosing fixed chunk size (setting min, max and avg size to all same value) is best for DB files, VMs, etc.

However, in the general (i.e. non DB, VM, etc) file case I am attempting to learn a bit more about the variable chunk size algorithm and to determine how setting -c 1M alone (without changing min and max as well) would tend to effect deduplication results? In other words, does the variability in chunk size under duplicacy’s control tend to increase deduplication results? Based upon what criteria or in what conditions does Duplicacy adjust the chunk size? Generally speaking, I tend to think that -c suggests a hint to Duplicacy and that it would be best to leave the rest under its control (min and max), but I’m hoping to better understand if that is a good assumption or if others have found better results in practice with fixed size chunks?

TIA!

towerbr · 13 July 2018 15:27

I’m getting good results with 128kb-1M-4M, but of course this depends heavily on the set of files (size, update frequency, number of files, etc.). My goal is “small uploads”, which in a way is related to the deduplication.

Yep! Likewise, I’m getting good results with 1M-1M-1M for mbox files, Veracrypt volumes and SQL databases.

I’m also following PR 456 with great interest.

As Gilbert said in a post somewhere, Duplicacy tends to create chunks with size close to the default size chunk defined. Take a look: Test #9: Test of wide range chunk setup

gchen · 13 July 2018 16:29

Fixed-size chunking is recommended when the following two conditions are met:

You only have a small number of files to back up.
Large files are only updated in-place (i.e., no insertion or deletion in the middle of a file).

By default Duplicacy packs files first and then split them into chunks, but when fixed-size chunking is chosen, this approach won’t work so it will split files directly. As a result, files smaller than the chunk size will be uploaded individually so you can’t have too many files (condition 1)

Fixed-size chunking is faster and more space-efficient than variable-size chunking. For instance, a one-byte change in a file for fixed-size chunks is guaranteed to generate one new chunk only, whereas for variable-size chunking, in theory it can generate a large number of new chunks (although it rarely happens).