I have made a pull request to add -persist
option to duplicacy to continue processing despite encountering missing or corrupt file chunks. This allows a restore
process to complete despite encountering file chunk errors, thus restricting the failure to the actual affected files. I think this is a useful improvement to increase robustness of duplicacy in dealing with missing/corrupt chunks given that data storage is seldom 100% reliable.
One thing I note though is that a point of failures is in the metadata (or snapshot) chunks. Missing/corrupt metadata chunks will still cause a complete failure of the restore process. In addition, it is possible that multiple backup revisions or snapshot ids could reference the same metadata chunks if they are refering to similar directory tree, meaning that some metadata chunks could be essential for multiple snapshots. In my view, it would be useful if robustness around missing/corrupt metadata chunks could be improved.
One intuitive way of doing so is for each backup to have a ‘secondary’ copy of snapshot metadata. This is similar to what is done in various file systems, e.g. the backup MFT (in NTFS) and FAT table (in FAT file systems). In principle, this could be done by preparing backup metadata sequences (e.g. "chunks2"
, "length2"
, "files2"
) and using this when required (perhaps with a -usesecondary
option). One difficulty with preparing a backup snapshot ‘table’ is ensuring that any secondary metadata chunks differ from from the primary metadata chunks (to ensure they are duplicated rather than being deduplicated). I don’t know enough about the chunk rolling hash/splitting algorithm to know how easy it is to create unique chunks. Would a small change at the start of the sequence (e.g. a “copy=2” marker) be enough to cause all downstream chunks to differ?
What are people’s thoughts on this?