Chunk size details

TheBestPessimist · 2 August 2018 07:17

Duplicacy adopts a unique pack-and-split method to split files into chunks. First, all files to be backed up are packed together, in an alphabetical order, as if it were building an imaginary tar file. Of course, this is only conceptual, otherwise Duplicacy would have quickly run out of memory in most use cases. This imaginary tar file is then split into chunks, using a variable size chunking algorithm. The default settings of the init command would generate chunks that are 4M bytes on average (although the actual averages may vary), at least 1M bytes and at most 16M bytes.

This pack-and-split method has two implications. First, any files smaller than the minimum chunk size will not individually benefit from deduplication. For instance, any change on a file that is 100K bytes or so will cause the entire file to be uploaded again (as a separate chunk, or part of a chunk if there are other changed files). On the other hand, when a directory consisting of many small files is to be moved or renamed, because these small files will be packed in the same order, most of chunks generated after the move or rename will remain unchanged, except for a few at the beginning and at the end that are likely affected by files in adjacent directories.

Another implication is that chunks do not usually align with file boundaries. As a result, when a file larger than the average chunk size is moved or renamed, the pack-and-split procedure will produce several new chunks at the beginning and the end of the file. At the same time, if the directory where the file original resides is backed up again (using the -hash option), then the hole left by this file will also cause several new chunks to be generated. There have been lengthy discussions on this topic such as Inefficient Storage after Moving or Renaming Files? and System design & performance issues.

While there are techniques to achieve the ‘perfect’ deduplication, keep in mind that the amount of overhead from such deduplication inefficiency is not unbounded. Specifically, the overhead is roughly proportional to the chunks size:

    overhead = a * c * chunk_size

where c is the number of changes, a is a small number representing the number of new chunks caused by each change. Therefore, by reducing the average chunk size, the deduplication ratio can be improved to a satisfactory level:

duplicacy init -c 1M repository_id storage_url

A chunk size smaller than 1M bytes isn’t generally recommended, because the overhead from the chunk transfer as well as the chunk lookup before uploading each chunk will start to dominate with small chunks (which however can be partially alleviated by using multiple uploading threads).

Fixed Size Chunking

Certain types of files, such as virtual machine disks, databases, and encrypted containers, are always updated in-place, and never subject to insertions and deletions. For this kind of files, the default variable size chunking algorithm in Duplicacy becomes over-complicate as it incurs the unnecessary overhead of calculating the rolling hash. The recommended configuration is to set all three chunk size parameters in the init or add command to the same value:

duplicacy init -c 1M -min 1M -max 1M repository_id storage_url

Duplicacy will then switch to the fixed size chunking algorithm which is faster and leads to higher deduplication ratios. In fact, Vertical Backup, a special edition of Duplicacy built for VMWare ESXi to back up virtual machines, uses this default configuration which has proven to work well in practice.

One important thing to note is that, when the fixed size chunking algorithm is chosen, Duplicacy doesn’t deploy the default pack-and-split method. Instead, no packing is performed and each file is split into chunks individually. A file smaller than the chunk size will be uploaded in a single chunk. Therefore, the fixed size chunking algorithm is appropriate only when there are a limited number of files.

Christoph · 25 August 2018 08:56