Improving robustness of duplicacy to missing/corrupt chunks with a secondary snapshot table

twlee · 7 May 2020 04:27

I have made a pull request to add -persist option to duplicacy to continue processing despite encountering missing or corrupt file chunks. This allows a restore process to complete despite encountering file chunk errors, thus restricting the failure to the actual affected files. I think this is a useful improvement to increase robustness of duplicacy in dealing with missing/corrupt chunks given that data storage is seldom 100% reliable.

One thing I note though is that a point of failures is in the metadata (or snapshot) chunks. Missing/corrupt metadata chunks will still cause a complete failure of the restore process. In addition, it is possible that multiple backup revisions or snapshot ids could reference the same metadata chunks if they are refering to similar directory tree, meaning that some metadata chunks could be essential for multiple snapshots. In my view, it would be useful if robustness around missing/corrupt metadata chunks could be improved.

One intuitive way of doing so is for each backup to have a ‘secondary’ copy of snapshot metadata. This is similar to what is done in various file systems, e.g. the backup MFT (in NTFS) and FAT table (in FAT file systems). In principle, this could be done by preparing backup metadata sequences (e.g. "chunks2", "length2", "files2") and using this when required (perhaps with a -usesecondary option). One difficulty with preparing a backup snapshot ‘table’ is ensuring that any secondary metadata chunks differ from from the primary metadata chunks (to ensure they are duplicated rather than being deduplicated). I don’t know enough about the chunk rolling hash/splitting algorithm to know how easy it is to create unique chunks. Would a small change at the start of the sequence (e.g. a “copy=2” marker) be enough to cause all downstream chunks to differ?

What are people’s thoughts on this?

TheBestPessimist · 7 May 2020 06:21

Your addition is very welcome, as we have talked about making (at least restore) not die at every error. (Linking my issue to this one for easier tracking: Add a flag to ignore if a file cannot be restored)

Note that i only skimmed over your PR’s description and code changes, so i can’t say much about it.
One thing I like is that that all the warnings are written to the user instead of being swallowed.

I would like to ask if maybe this feature should be flipped: by default should work in a “restore everything possible no matter if chunks are bad”, and the flag should be -stopOnError.

This is because, imo during restore i would not like to stop if anything mundane happens (ie. 1 missing chunk from a 3TB revision), except in possibly very… exceptional cases, where i would use/expect this flag.

This also goes in line with how I expect backup to act: don’t die on me if you cannot access a file. (see what i mean here: Can we skip locked or failing files (without VSS)?)

off topic portion: english got the better of me today.

Your addition is very welcomed

Is this correct? should it be

Your addition is very welcome

instead?

Droolio · 7 May 2020 16:57

Definitely, this should be the preferred behavour.

In fact, why even have a -stopOnError flag? IMO, it should produce errors/warnings in the output when it encounters such files (to inform the user) and then error at the end when it’s finished restoring the snapshot.

User can always Ctrl-C, but it’s not like they’d not want to run it again until it completes, just to see how much else was corrupt/missing.

Am English. While I can’t precisely explain why, this sounds better.

The other still works, but is more past-tense or sounds like it’s on behalf of others, which I guess also works but we haven’t necessarily consulted.

Droolio · 7 May 2020 17:00

Regarding metadata chunks… You could perhaps add different nonce values to the encryption layer when dealing with such chunks. A simpler method would be to have the same chunk written twice - one with a .bak file extension.

However, I don’t really like the idea of duplicating chunks that were designed to be de-duplicated. With the exception of the config file, since it’s very small.

Missing/corrupt metadata chunks shouldn’t be a huge problem so long as you run regular checks. In fact, unlike non-metadata chunks, a normal check should verify the integrity of such chunks without the -chunks or -files flags - since it has to unpack and read all that metadata in order to get the chunklist. IF a metadata chunk does get corrupted, you’ll find out soon enough and should be able to fix it quickly - maybe with a new backup ID, or comparing cached chunks.

Instead, I’d actually like to see a form of parity protection, covering all backup data.

I know a lot has been said about the actual storage medium should be responsible for the integrity of your data, but let’s face it, not everyone has ZFS or unRAID or can 100% trust cloud storage. Imagine if you could add, say, 3 parity chunks for every 17 data chunks. Hell it’d be very tricky to implement - you might have to add a separate process to add and prune this data in an exclusive-only fashion. But you’d have a level of data protection independent of storage medium.

twlee · 8 May 2020 06:40

Happy to have this by default, I left it as optional to minimise differences with the current version.

@Droolio You’re right that regular checks should detect metadata corruption quickly. However, one use case I have is an ‘archived backup’, i.e. backup of old files in which incremental backups and checks are done only very occasionally.

Parity correction would be ideal, but definitely quite a bit of work! My backup snapshot table was a ‘quick-fix’ idea, although you are right that this does make things a bit messier.

Thinking about this a bit further, perhaps a -copymetadata command would be a conceptually cleaner workaround - this would allow a user to make additional copies of a backup metadata to additional storage locations. With -bit-identical storage locations, these backup chunks should be able to be mixed with the original backup to reconstruct the snapshot metadata.

twlee · 8 May 2020 12:15

Update: I’ve implemented a -metadata-only parameter for the copy and check commands. The copy -metadata-only command will only copy metadata chunks from one storage to another. check -metadata-only will only check metadata chunks in a storage, allowing check to complete successfully for metadata only copies. This was rather easy to implement as it wasn’t really introducing anything new.

As noted above, this is meant for storage that is -bit-identical as it allows any metadata-backup chunks to be substituted in the primary backup as-is. The chunks in the metadata backup will have the same name but these ‘duplicate’ encrypted chunks created with the copy command will differ from the original at the binary encrypted level as they are encrypted with different nonce values (this is the default behaviour of copy).

For anyone who is interested, this version is available on my fork here: GitHub - twlee79/duplicacy at twldev_copysnapshot

I’m reluctant to submit these changes as a pull request (PR) as I think this new feature is somewhat ‘niche’ and won’t benefit many people (I’m likely to use it myself, however). In addition, there is currently no flag saved in a snapshot or storage to indicate that it is meant to be ‘metadata only’, thus it is up to users to correctly manage this feature. Not really ideal for production code, so such a system would need to be implemented if this was to be taken forward.

bkeeper · 8 May 2020 17:49

Thank you

I still think you should submit the PR.
Even if it is only a starting point and not feature complete.

FR: Auto-backup with multiple copies of duplicacy config and chunk metadata would be an essential feature for me. Even with default cross copy to multiple storage. Local + cloud.

Once we implement a remote api Duplicacy will need a central config to manage all configs, and then autobackup will be more important than ever.

Droolio · 9 May 2020 17:17

Does copy -metadata-only still work with storages that are not -bit-identical?

If so, this would be fine for all types of storages, because you could retrieve the config of any storage and copy -metadata-only again to it. The metadata chunks would revert back to the original and you could use any to fix missing/corrupt chunks.

This is a pretty neat features actually, and I like that it can be done on a separate storage. You should definitely submit a PR.

Wondering if a storage (through config) or snapshots should be made ‘metadata-only’-aware to cope with things like prune etc.?

dreamflasher · 6 July 2020 14:57

This would be immensely helpful, it would also allow to restore a backup without overwriting existing files. (If -overwrite is not specified it just errors out) – any chance to get this in?