Lock free deduplication algorithm

gchen · 2 August 2018 07:37

The three elements of lock-free deduplication are:

Use variable-size chunking algorithm to split files into chunks
Store each chunk in the storage using a file name derived from its hash, and rely on the file system API to manage chunks without using a centralized indexing database
Apply a two-step fossil collection algorithm to remove chunks that become unreferenced after a backup is deleted

Variable-size chunking algorithm

The variable-size chunking algorithm, also called Content-Defined Chunking, is well-known and has been adopted by many backup tools. The main advantage of the variable-size chunking algorithm over the fixed-size chunking algorithm (as used by rsync) is that in the former the rolling hash is only used to search for boundaries between chunks, after which a far more collision-resistant hash function like MD5 or SHA256 is applied on each chunk. In contrast, in the fixed-size chunking algorithm, for the purpose of detecting inserts or deletions, a lookup in the known hash table is required every time the rolling hash window is shifted by one byte, thus significantly reducing the chunk splitting performance.

What is novel about lock-free deduplication is the absence of a centralized indexing database for tracking all existing chunks and for determining which chunks are not needed any more. Instead, to check if a chunk has already been uploaded before, one can just perform a file lookup via the file storage API using the file name derived from the hash of the chunk. This effectively turns a cloud storage offering only a very limited set of basic file operations into a powerful modern backup backend capable of both block-level and file-level deduplication. More importantly, the absence of a centralized indexing database means that there is no need to implement a distributed locking mechanism on top of the file storage.

By eliminating the chunk indexing database, lock-free duplication not only reduces the code complexity but also makes the deduplication less error-prone. Each chunk is saved individually in its own file, and once saved there is no need for modification. Data corruption is therefore less likely to occur because of the immutability of chunk files. Another benefit that comes naturally from lock-free duplication is that when one client creates a new chunk, other clients that happen to have the same original file will notice that the chunk already exist and therefore will not upload the same chunk again. This pushes the deduplication to its highest level – clients without knowledge of each other can share identical chunks with no extra effort.

There is one problem, though.

Deletion of snapshots without an indexing database, when concurrent access is permitted, turns out to be a hard problem. If exclusive access to a file storage by a single client can be guaranteed, the deletion procedure can simply search for chunks not referenced by any backup and delete them. However, if concurrent access is required, an unreferenced chunk can’t be trivially removed, because of the possibility that a backup procedure in progress may reference the same chunk. The ongoing backup procedure, still unknown to the deletion procedure, may have already encountered that chunk during its file scanning phase, but decided not to upload the chunk again since it already exists in the file storage.

Fortunately, there is a solution to address the deletion problem and make lock-free deduplication practical. The solution is a two-step fossil collection algorithm that deletes unreferenced chunks in two steps: identify and collect them in the first step, and then permanently remove them once certain conditions are met.

Two-Step Fossil Collection

Interestingly, the two-step fossil collection algorithm hinges on a basic file operation supported almost universally, file renaming. When the deletion procedure identifies a chunk not referenced by any known snapshots, instead of deleting the chunk file immediately, it changes the name of the chunk file (and possibly moves it to a different directory). A chunk that has been renamed is called a fossil.

The fossil still exists in the file storage. Two rules are enforced regarding the access of fossils:

A restore, list, or check procedure that reads existing backups can read the fossil if the original chunk cannot be found.
A backup procedure does not check the existence of a fossil. That is, it must upload a chunk if it cannot find the chunk, even if an equivalent fossil exists.

In the first step of the deletion procedure, called the fossil collection step, the names of all identified fossils will be saved in a fossil collection file. The deletion procedure then exits without performing further actions. This step has not effectively changed any chunk references due to the first fossil access rule. If a backup procedure references a chunk after it is marked as a fossil, a new chunk will be uploaded because of the second fossil access rule, as shown in Figure 1.

The second step, called the fossil deletion step, will permanently delete fossils, but only when two conditions are met:

For each snapshot id, there is a new snapshot that was not seen by the fossil collection step
The new snapshot must finish after the fossil collection step

The first condition guarantees that if a backup procedure references a chunk before the deletion procedure turns it into a fossil, the reference will be detected in the fossil deletion step which will then turn the fossil back into a normal chunk.

The second condition guarantees that any backup procedure unknown to the fossil deletion step can start only after the fossil collection step finishes. Therefore, if it references a chunk that was identified as fossil in the fossil collection step, it should observe the fossil, not the chunk, so it will upload a new chunk, according to the second fossil access rule.

Therefore, if a backup procedure references a chunk before the chunk is marked a fossil, the fossil deletion step will not delete the chunk until it sees that backup procedure finishes (as indicated by the appearance of a new snapshot file uploaded to the storage). This ensures that scenarios depicted in Figure 2 will never happen.

This post is also found in the Duplicacy github repository at GitHub - gilbertchen/duplicacy: A new generation cloud backup tool
Please also modify the post there when you edit this #how-to : Lock Free Deduplication · gilbertchen/duplicacy Wiki · GitHub

deathbynukes · 11 November 2019 01:08

I’ve come back to this page multiple times throughout the week and I can’t figure out what this section is trying to say. Finding out that “snapshot id” actually means “repository id” helps make it a bit less inane but I still don’t know how lock-free fossil deletion is possible.

Condition 1 seems backwards… I would think that finding out that someone else is simultaneously making a backup would be irrelevant or at least a condition for NOT continuing with deletion.

I can’t tell what “new snapshot” Condition 2 is talking about. If it’s referring to the simultaneous backup by another user in Condition 1, this “condition” isn’t even a real testable condition from the deletion procedure’s perspective, it’s a constraint placed on another user through a lock or some other communications channel.

gchen · 11 November 2019 04:39

A snapshot file is uploaded at the end of the backup, so when you see a new snapshot file in the storage, the corresponding backup is no longer ongoing or simultaneous – it is finished.

Yes, it is. What gives us trouble is a backup that starts before the fossil collection step ends and is still ongoing when the fossil deletion step starts. Conditions 1 and 2 guarantee that such a backup won’t be possible. Condition 1 alone isn’t enough, because if there is a backup that starts after the fossil collection step starts but ends before the fossil collection step ends, then theoretically it is possible that another backup may start before the fossil collection step ends.