As my testing continues, I’m wrestling with chunk size options and would appreciate some advice please. I’ve consulted chunk-size wiki as well as the other discussions at 334 and system design & performance issues.
As deduplication is one of my primary goals, I’m also following PR 456 with interest - especially in combination with a future advanced prune concept mentioned there.
From these, is seems that choosing a smaller (perhaps 1M) chunk size may tend to increase deduplication (one of my primary goals and Duplicacy’s main design goals). It also seems clear that choosing fixed chunk size (setting min, max and avg size to all same value) is best for DB files, VMs, etc.
However, in the general (i.e. non DB, VM, etc) file case I am attempting to learn a bit more about the variable chunk size algorithm and to determine how setting -c 1M alone (without changing min and max as well) would tend to effect deduplication results? In other words, does the variability in chunk size under duplicacy’s control tend to increase deduplication results? Based upon what criteria or in what conditions does Duplicacy adjust the chunk size? Generally speaking, I tend to think that -c suggests a hint to Duplicacy and that it would be best to leave the rest under its control (min and max), but I’m hoping to better understand if that is a good assumption or if others have found better results in practice with fixed size chunks?