Yes, your understanding is correct.
The problem boils down to a fact that second backup overwrites the first backup’s revision file.
If this wasn’t the case then everything would be fine and things wouldn’t get corrupted.
I believe that the fix is to do the following:
Theoretical side
Fix PDF’s proof of “Theorem 3”. It only considers 2 out of 6 cases of interaction of A and B backups.
# These two are considered in PDF
Case A)
Backup --------------------------------------------------------------
| A | | B |
------------ -----------
Case B)
Backup --------------------------------------------------------------
| B | | A |
------------ -----------
# These are not considered in PDF
Case C)
Backup --------------------------------------------------------------
| A | | |
------------------|----------- |
| B |
---------------------
Case D)
Backup --------------------------------------------------------------
| B | | |
-----------------|----------- |
| A |
---------------------
Case E)
Backup --------------------------------------------------------------
| A | | |
-----------------|-------------------|----------
| B |
---------------------
Case F)
Backup --------------------------------------------------------------
| B | | |
-----------------|-------------------|----------
| A |
---------------------
If the proof would consider latter 4 cases then it would find that at least one of these cases fails to prove the “Theorem 3”. The way to fix that is to add “Policy 4” with something along the lines:
Policy 4:
- Any backup must try to write a revision file with revision_no one greater than the last "seen" finished backup # this is what software does and it is assumed in the paper but not stated explicitly
- Any backup must not (over)write a revision file if it already exists # this is not what software does and hence the race condition / corruption, and is unclear whether it is assumed in the paper or not
With that additional Policy I believe the “Theorem 3” holds and is proven back again.
Engineering side
I guess the solution here would be to claim the revision with a placeholder file that will be replaced with actual manifest upon successful completion of the backup
This may work, yes, but let me propose another solution: create revision file in exclusive mode. In the world of C: fopen("/snapshot/revision", "wx")
, or POSIX world: open("...", O_EXCL)
, or in shell: set -C # noclobber
.
I assume the code is different for each backend (filesytem, S3 API, etc…), but hopefully it is possible to do in every backend duplicacy supports.
For example in the “filesystem backend” the code probably looks something like this:
createRevisionFile() {
let f = fopen("/snapshot/revision", "w");
fwrite(f, ...);
}
or
createRevisionFile() {
let f = fopen("/snapshot/revision.tmp", "w");
fwrite(f, ...);
rename("/snapshot/revision.tmp", "/snapshot/revision");
}
And should be changed to:
createRevisionFile() {
let f = fopen("/snapshot/revision", "wx");
fwrite(f, ...);
}
or
createRevisionFile() {
let f = fopen("/snapshot/revision.tmp", "wx");
fwrite(f, ...);
renameat2(..., "/snapshot/revision.tmp", ..., "/snapshot/revision", RENAME_NOREPLACE);
}
And in “S3 backend”:
createRevisionFile() {
callS3API("createAndWriteFile", ...);
}
Should be changed to:
createRevisionFile() {
callS3API("createAndWriteFileButOnlyIfNotExists", ...);
}
I never used S3 API so I am making up the API names above but hopefully I managed to convey my idea.