I recently experienced a failure in both my backup drives, making it so I could save most but not all the files stored on those drives. I wasn’t actually using duplicacy, but the event has made me review backup software with an eye toward this failure.
I poured over the snapshot file format as described, and I noticed it doesn’t seem very resilient to a loss of a metadata chunk. Granted, that should be rare, as metadata is generally much smaller than data; but then I noticed that metadata doesn’t deduplicate very well. Now maybe this is actually good, since then the loss of a metadata file is less likely to affect other snapshots; but I thought I’d bring this up in case you weren’t aware of it.
The issue boils down to keeping the chunk list separate from the file list.
- The loss of any chunk containing the chunk sequence means that snapshot is unrecoverable.
- If there are large enough changes to files that occur early in the file sequence, then all the files after that will have chunk indices (StartChunk/EndChunk) that are different from previous snapshots, thus preventing the chunks encoding the file sequence from deduplicating against the previous snapshot.
It isn’t clear that losing the length list should have any effect on recovery (if the data chunk can be unencrypted and uncompressed, then it is very likely the length is correct). Losing a file sequence chunk might right now mean a snapshot is unrecoverable, but with a sufficiently aggressive json decoder it should be able to recover on the next file sequence chunk (this could be made even easier if file sequences were only chunked on file boundaries rather than byte boundaries - though care would have to be taken to preserve the repeatability of content defined chunking).
Without diving into the code too much, I can guess why the file format was chosen the way it was - it means files can be emitted to the file sequence before the chunks have been completed (and thus hashed). It wouldn’t actually take that much buffering, though, to make it work.
Interestingly, I assumed moving the chunks into the file sequence would hurt metadata size after compression, but a quick Python script showed otherwise for my test case. I actually see a 1.7% reduction in size post-LZ4 for a snapshot with around 23k files and 1200 data chunks (and only 18% increase pre-compression).
In the end I’m not sure the change is worth making given the need to support new and old file formats, though maybe if it were implemented as a old->new filter that was used during new backups and restores of old snapshots, then the code would get sufficient exercise and you’d only have to implement restore of the new format and backup of the old format, thus reducing code to implement/test.
Thoughts?