Feature Bounty: detect and repair corrupt chunk fails

My SFTP restore went horribly wrong:

I like to put a “Feature” Bounty in place:
@gchen is duplicacy connected to a platform like bountysource? https://www.bountysource.com/
I like to have a feature which

  1. detects corrupted chunks in a backup
  2. repairs them automatically

I know that it is possible to use the check option but to redownload all the files is not very efficient and the rebuild of a corrupted chunk is manual work.

1 Like

I’m going to implement erasure coding using this library: GitHub - klauspost/reedsolomon: Reed-Solomon Erasure Coding in Go.

3 Likes

This sounds absolutely fantastic but, would this help when, for whatever reason, chunks are seemingly truncated during upload? I think Duplicacy may need some extra verification checks (file size / checksum) when running backups…

I don’t think it’s a good strongly believe it is a terrible idea from a feature creep perspective. Duplicacy already validates the chunks. Repairing them is not a job of a backup tool, especially at the expense of extra space taken on the destination. The storage that customers use must provide consistency guarantees, and corruption detection and recovery.

What next? re-implement ZFS in Duplicacy? Make duplicacy its own operating system? Where would you draw a line?

I agree that duplicacy must validate that the chunk was uploaded successfully; at the current state of affairs this means validating length only, as the content is transmitted over encrypted connection; the only risk is truncation, and that only if backend does not support upload failure reporting.

1 Like

I strongly disagree. The vast majority of end-users of any backup software will not be using anything like ZFS as a storage backend. They’ll be using local, perhaps external HDDs, or any cloud provider with unknown storage practices and resilience (Backblaze an exception).

Now, Duplicacy isn’t any simple backup software. In order to do what it does, it packs your data, compresses it, encrypts it. Practically locks it away. If a user’s external HDD just has a simple copy of their data - if any of that got corrupted through ‘bit-rot’ - it isn’t the end of the day. They can at least access it, even the corrupted file. If it’s a media file, you may not even notice. Text file; the same.

Because Duplicacy packs your data in this way, a single bit error renders the whole chunk (and the rest of the restored file) inaccessible. IMO, that’s Duplicacy’s responsibility to safeguard against as much as possible. I’m not saying the user isn’t responsible for having multiple backup copies of their data. (Reed-Solomon wouldn’t protect against disk failure anyway.). I am saying Duplicacy’s storage format is generally more fragile to warrant extra protection. And, quite frankly, when dealing with important data, Erasure Coding is something many people would expect through familiarity with tools such as WinRAR’s Recovery Record feature. I certainly would.

Furthermore, I wouldn’t call that feature-creep in the slightest. I don’t care if it’s optional, but implementation would be straightforward and succinctly specified as an important, atomic, feature. I don’t understand why such a feature would be a bad thing?! Where is the downside?

3 Likes

In fact, after I posted that, I had time to think it through carefully, and indeed this does seems like a definite net benefit. The storage overhead will be likely negligible, computational overhead does not matter, and indeed for the bulk of users who backup to single usb drive this may be lifesaving for their data, depending on the type and size of corruption.

Where is the downside?

The downside of any extra feature is increase in complexity and new opportunities for bugs, but in this particular case I think it is justified and worth it.

To summarize: I still think it’s not a job of backup tool to validate storage, but as courtesy to the users who use crappy storage this is a useful value-add feature.

In my restore failure it was not a storage issue. It looks like the chunk files were not transmitted successfully

Right. There are two part to it:

  1. Uploading the chunk successfully
  2. preventing the chunk from rotting at the destination.

The 1. is definitely Duplicacy’s job. There was a bug handling backend reported failure that is going to be fixed, as far as I understand.

The 2. is arguably not — and the whole discussion that it may be a useful feature to have to write data with some level of redundancy to handle somewhat unreliable storage is a very slippery slope.

The problem with that is that once we go to the realm of “if it is not corrupted too much it will probably be recoverable with some randomly selected redundancy” and away from “this storage guarantees data consistency so you backup will be guaranteed recoverable” your backup stops being ironclad and deterministic and hence not reliable and hence useless.

By the way, the most common failure on drives is URE where disk fails to read a sector. If this disk has ERC configured to give up immediately you get 512 bytes of garbage. Can the suggested algorithm handle that? If not — there is no point to bother.

And that’s a best case. Worst case is the desktop drive will keep retrying the read until cows come home. So all recovery is out of the window.

It might however help will bit rot — but as soon as Duplicacy makes a claim that it supports recovery people will start using progressively shittier media and eventually lose data.

I think it’s still better to say — provide reliable storage or you are on your own. Well defined criteria with deterministic outcome.

To put it another way: the only way to guarantee backup restorabiliry is to use storage with consistency guarantees: (on every level of redundancy implemented there would be a bit crappier storage that would rot beyond threshold). And if that’s the case — there is no need for redundancy in Duplicacy.

Everything else are various shades of “maybe”s and this is not good enough to safeguard the data. I want “will restore”. Not “likely will recover”.

(Same story with supporting DropbBox and OneDrive as backends. I feel they need to be removed because of the crazy latency and timing out list transactions they are fairly unusable. It’s best to have stated functionality be ironclad as opposed to giving vague promises that are impossible to keep)

The best, and often likely, case is you can retrieve (the rest of) the chunk off a disk with bad sectors - quite easily. When I’d encounter UREs, I’m not gonna let Duplicacy keep hacking away at those sectors. I’d switch the system off and start running ddrescue to get as much data off the drive as possible. Done it many times very successfully. Even Unstoppable Copier would do the job in most situations.

It’s precisely the scenario that would allow Reed-Solomon to fix a corrupted chunk file. With average 1MB chunk size and typically ~15% error recovery rate, it’d easily be able to fix even 4K bad sector errors, several of them per chunk.

I’m not saying other practices such as better storage / file systems and having separate, off-site, copies aren’t of course warranted. But bad sector situations are common in consumer land, and recovery with tools such as ddrescue is also relatively common and more known about now.

So if I did have just 512 bytes of bad data, I’m be much more concerned when it’s packed up in chunks - compared to just having a raw copy of the data.

Imagine that 1 bad sector residing in a chunk that so happens to be part of an even bigger multi-gigabyte mbox file for Thunderbird, say. Compare the recovery scenario if instead I just made a raw copy backup. Yes I could resort to my secondary remote backup storage, but knowing that a good portion of my primary local backup is toast because of one bad sector, kinda makes the secondary copy rather… precarious.

1 Like