Metadata-loss resilient backup

scott.smith · 18 October 2020 00:55

I recently experienced a failure in both my backup drives, making it so I could save most but not all the files stored on those drives. I wasn’t actually using duplicacy, but the event has made me review backup software with an eye toward this failure.

I poured over the snapshot file format as described, and I noticed it doesn’t seem very resilient to a loss of a metadata chunk. Granted, that should be rare, as metadata is generally much smaller than data; but then I noticed that metadata doesn’t deduplicate very well. Now maybe this is actually good, since then the loss of a metadata file is less likely to affect other snapshots; but I thought I’d bring this up in case you weren’t aware of it.

The issue boils down to keeping the chunk list separate from the file list.

The loss of any chunk containing the chunk sequence means that snapshot is unrecoverable.
If there are large enough changes to files that occur early in the file sequence, then all the files after that will have chunk indices (StartChunk/EndChunk) that are different from previous snapshots, thus preventing the chunks encoding the file sequence from deduplicating against the previous snapshot.

It isn’t clear that losing the length list should have any effect on recovery (if the data chunk can be unencrypted and uncompressed, then it is very likely the length is correct). Losing a file sequence chunk might right now mean a snapshot is unrecoverable, but with a sufficiently aggressive json decoder it should be able to recover on the next file sequence chunk (this could be made even easier if file sequences were only chunked on file boundaries rather than byte boundaries - though care would have to be taken to preserve the repeatability of content defined chunking).

Without diving into the code too much, I can guess why the file format was chosen the way it was - it means files can be emitted to the file sequence before the chunks have been completed (and thus hashed). It wouldn’t actually take that much buffering, though, to make it work.

Interestingly, I assumed moving the chunks into the file sequence would hurt metadata size after compression, but a quick Python script showed otherwise for my test case. I actually see a 1.7% reduction in size post-LZ4 for a snapshot with around 23k files and 1200 data chunks (and only 18% increase pre-compression).

In the end I’m not sure the change is worth making given the need to support new and old file formats, though maybe if it were implemented as a old->new filter that was used during new backups and restores of old snapshots, then the code would get sufficient exercise and you’d only have to implement restore of the new format and backup of the old format, thus reducing code to implement/test.

Thoughts?

saspus · 18 October 2020 05:50

Backup storage needs to provide data consistency guarantees. Loss of data shall not be possible. See BTRFS or ZFS or cloud propositions that guarantee consistency.

If such guarantee is not provided, as is often the case with single hard drives, conventional raid arrays and low tier non-redundant cloud storage — it’s a coin toss of whether you will ever be able to restore your data. Recent feature Duplicacy implemented (erasure coding) is aimed to improve resilience to certain types of datastore corruption but ultimately if you want to ensure that your backup is viable you have to use reliable storage. Everything else would be half-measures that don’t guarantee the outcome.

scott.smith · 18 October 2020 06:13

I was using BTRFS, with the file system RAID1 across two drives. As I said, they both failed. Still working on how much I can recover (or whether I even care to try).

Of course in an ideal world we don’t lose data. The point of the topic, though, is that it is possible to increase the resiliency of the backup against loss of chunks.

Put it this way - why bother implementing erasure coding? All HDD do this already. If you get enough of a read error, in my experience the OS will simply truncate the file, not allowing you to read the non corrupted parts. Clearly some see an added benefit to adding that amount of resiliency to the backup. Limiting the scope of loss when losing a metadata chunk is just another layer in making a robust backup solution.

saspus · 18 October 2020 06:40

I was against it from the beginning.

In my view it’s not a job of a backup software to guarantee storage reliability. It simply can’t do it efficiently. It does not have access to individual disks. Erasure coding is an expensive bandaid (see below)

Of course storage can fail. But it ether needs to return good data or no data at all, as in report failure. Returning bad data is not an option. Good storage must guarantee that. Otherwise what next? Will we doubt Ram integrity? CPU correctness? Etc. some things shall be postulated and relied upon: such as what you read from Ram is what you wrote (hello ECC) that cpu is correct, pcie bus is reliable, storage facilities return the same data they were entrusted to keep.

On every increase in resilience of your restore workflow there would be higher degree of corruption to defeat it. Keeping on increasing redundancy on a backup store level is a dead end.

Instead, backup shall be done to multiple destinations. Your BTRFs array provides data integrity guarantees and thats great. You don’t have to run check -chunks in duplicacy; you can just scrub filesystem locally.

Redundancy in your array help improve uptime. But that’s about it. It does not make it infallible. The solution is not more redundancy downstream; but more backup destinations.

This is a general trend in the industry — instead of increasing reliability of one thing — have fifteen highly unreliable things. End result, aggregate reliability comes out better and cheaper.

That would be a very bad behavior and a bug. The data either needs to be returned intact or read error shall be generated. BTRFs would try to correct the corruption and either succeed or fail.

Yes, because it’s a relatively simple thing with high reward, as many people use single disks prone to rot or otherwise developing bad sectors. It’s just a lesser of two evil.

Sinking more resources into further improvements to this however would bring diminishing returns and are not worth investing time into. (That is my opinion, and I’m just a user, not affiliated with Duplicacy/Acrosync in any way. I want to make this clear)

Otherwise where would you draw a line? Shall Duplicacy be it’s own filesystem? Turn it into OS on a SAN appliance?

It’s a backup tool. It saves data to remote storage it knows nothing about. Storage cost money. Storage provides durability guarantees. Let storage focus on storing and backup tool on backing up.

Adding features that don’t belong is a slippery slope resulting in a feature creep and ultimate demise of a project. Id rather that not happen.

Just my $0.02

scott.smith · 18 October 2020 15:21

Keeping on increasing redundancy on a backup store level is a dead end.

I’m not talking about increasing redundancy. In fact, what I brought up increases deduplication of metadata, which arguably decreases redundancy. It also compresses better, I believe due to not repeating the chunk index twice in the content field for small files.

gchen · 18 October 2020 17:28

I think tweaking the snapshot format may increase the resiliency to metadata loss a little bit, however it is not worth the effort. The right way should be to run regular copy jobs to copy backups to a secondary storage. In fact, if you’re only worried about the metadata loss but don’t want to have a full secondary storage, maybe we should just add an option to the copy command so that it will only copy metadata chunks.

This is also my opinion towards Erase Closure. It is a great feature when you only have one storage, but it is far from 100% resilient to data corruption. A second storage is much more likely to help recover from disasters.

Droolio · 19 October 2020 01:33

I agree with these sentiments and most of what saspus says too, but I’m very glad Erasure Encoding now exists, because IMO the decision to use chunk-based storage for a backup slightly imperils your data somewhat - compared to a straight copy of files. With a raw file copy, you can at least retrieve uncorrupted files. With chunks, and metadata chunks specifically, just a few bad sectors can cause a huge problem for that storage.

While the answer is to have multiple storages (I always do), the issue is that not everyone can afford that many copies. Take for instance the 3-2-1 strategy. I have no idea where it originated, but I see a lot of people cheat and declare one of those 3 copies as the original data, and not to have more than 2 backups. (Well, I personally think it’s cheating anyway. I prefer to consider three actual backups is a better goal.) However, the fragility of those copies makes a big difference as to whether 2 backups are in fact enough, which most people would probably settle on.

I haven’t yet played around with Erasure Encoding yet. One of my concerns is whether the snapshot files were also encoded(?), and also whether the config file was protected in the same way. There was talk about making extra copies of the config file - both locally, in the cache, and on the storage. But I do like the idea of a metadata-only copy. (Wasn’t there a PR for this?)

IMO, if these small things could be accomplished, I’d be much happier in the degree of Duplicacy’s storage resiliency. No OS on a SAN appliance required.

Christoph · 21 October 2020 09:32

I think there are multiple questions being discussed here and I don’t have the knowledge to disentangle them, so I just want to highlight this:

Because: