Duplicacy paper data loss

robo-eric · 17 December 2024 00:17

Hello,
I am currently working on an implementation of the paper “Duplicacy: A New Generation of Cloud Backup Tool Based on Lock-Free Deduplication” which describes the basic algorithm behind Duplicacy.

The paper is very interesting but I found two scenarios where the proposed algorithm can cause data loss to occur. To make sure that they are not caused by me misunderstanding the paper I want to present them to you and maybe find out how Duplicacy solves them.

Scenario 1: Race condition during fossil collection and deletion

The paper claims that some processes performing fossil collection while other processes are performing fossil deletion are lock-free. In the following scenario, the algorithm deletes a fossil that is referenced by a manifest file. This could lead to data loss as follows:

Client 3 creates a manifest file called #1 referencing a chunk that is not referenced by any other manifest file.
Client 2 decides to prune manifest #1 and performs a fossil collection step. This leaves the chunk as a fossil.
Client 2 creates a manifest file called #2 referencing the same chunk.
This causes a new chunk to be uploaded.
All other clients create at unrelated manifest file, allowing Client 2 to proceed with the fossil deletion step.
Client 1 decides to prune manifest #2 and performs a fossil collection step. This leaves the chunk as a fossil (replacing the existing one)
Client 2 performs the fossil deletion step, which causes the fossil to be deleted since it is not referenced by any manifest file.
Client 3 creates a manifest file called #3, which referenced the chunk, but the check for its existence happened during step 5 before it was renamed into a fossil. This prevents a new chunk to be uploaded, and leads to a missing fossil.

fossil_deletion.drawio(1)

I think the reason for this behavior is that after step 5 two fossil collections exist containing the same fossil.
Since fossils from different fossil collections are not separated, a fossil can be deleted by any fossil deletion step processing a fossil collection containing it, even when Policy 3 would prevent another fossil collection from proceeding with the fossil deletion step.

This scenario can be prevented by limiting the fossil collection and deletion steps to a single client (and allowing only one fossil collection to exist at a time).
I’m wondering whether this problem has been discovered before – at least in the Duplicacy implementation, I see that the prune command should only be run by a single client.

Scenario 2: Manifest files being replaced

While the paper does not explicitly mention clients replacing manifest files, such scenario can happen in case the client does not take precautions against it. In the following scenario, this leads to a new manifest file that is not identified as such, causing a fossil to be deleted instead of being recovered:

Client 1 creates a manifest file called #1.
Client 1 performs a fossil collection step that does not prune #1 but turns a chunk into a fossil.
Client 2 creates another manifest file called #1 that references the chunk turned into a fossil, but the check for its existence happened before it was turned into a fossil.
This causes no new chunk to be uploaded.
All clients create a unrelated manifest file to allow Client 1 to proceed with the fossil deletion step.
Client 1 performs the fossil deletion step that ignores manifest #1 since it was seen by the fossil collection step and therefore deletes the referenced fossil

manifest_replacement.drawio

In the example above, I assumed that the “diff” function used by the fossil deletion algorithm works on manifest file names and can therefore not differentiate between different manifest files with the same name.

I hope the presented scenarios are understandable. Any feedback whether my observations are wrong or right would be very helpful.

Droolio · 17 December 2024 11:15

Prune does not typically remove the most recent ‘manifest’ of a snapshot ID. It’s protected from prune unless an out-of-bounds -exclusive prune is run, which bypasses two-step fossilisation. So this scenario would never happen - even if you ignored the advice to only run prune on one client.

Correct. Prune should only be ran by one client.

In this scenario, you seem to be using the same snapshot ID for multiple clients - which is a big nono. Each client (specifically each backup source repository) should have its own #1 ‘manifest’, or ID. They shouldn’t be shared between repositories not least clients.

Edit: Incidentally, some of these corner cases are already mentioned for the prune command:

robo-eric · 17 December 2024 22:36

Thanks for the reply, it has been very helpful.

While this is true this protection can be sidestepped by Client 3 creating more manifest files (which just don’t reference that one chunk) between step 1 and 2 with Client 2 doing the same between step 3 and 5.
The quoted limitation of initial backups being at risk can also be sidestepped by having every client create at least one manifest (which again just don’t references that one chunk) at the beginning.
The end result would still be the premature removal of the fossil by it essentially “moving” between fossil collections.

What I am thinking about is whether this problem was explicitly considered when deciding the limitations of the prune command (since this limitation is not mentioned in the paper) or it being solved as a side effect of:

This is true, I initially assumed that such an operation is allowed since the paper did not explicitly disallowed it. After thinking about the nature of manifests (them being essentially read-only after creation) and the impact such operations would have on other features I think disallowing them is a valid solution.