Check -chunks doesn't seem to do anything

Droolio · 20 July 2020 09:21

This isn’t necessarily true. I recently encountered a situation where the storage had run out of space (the prune operation had got stuck for a few days) and Duplicacy was happily writing truncated chunks to the storage. The storage integrity was fine, but bad data was being flushed to disk.

An -exhaustive prune alone didn’t fix the storage, as some snapshots referencing bad chunks were also flushed to disk. So knowing that a referenced chunk is corrupted is actually helpful, as it doesn’t mean the rest of the storage is toast, and you can recover from such situations.

towerbr · 20 July 2020 11:56

I didn’t understand.

Was the storage full and was replacing the chunks?

Or was it low on space and wasn’t saving all the necessary chunks?

And it didn’t throw any error messages?

Droolio · 20 July 2020 12:53

I don’t fully understand how it was able to flush data to the disk (ext4), but it wrote complete snapshots (without error) to disk that referenced chunks that I know were truncated due to lack of disk space.

There were around 8 or so chunks that were non-0-byte but which had datestamps around the time when it ran out of space, but which were obviously truncated because a check -chunks said it couldn’t decrypt those chunks. (I had to manually delete these chunks because a normal check only tests if the chunks exists and I wanted to know what snapshots were affected.)

What I suspect happened is that it ran out of space, some chunks were partially written to the disk, and the backup failed at this point. A scheduled prune cleared a bit disk space and a subsequent backup succeeded this time, referencing the bad chunks that had already been written.

towerbr · 20 July 2020 15:22

Without any error message?

saspus · 20 July 2020 18:47

This is the important point here: how was that allowed to happen? What storage backend was that? Is it reproducible? This seems like either Duplicacy bug ignoring failure returned in this specific corner case or storage failing at being storage: not guaranteeing that what’s written can be read.

And while this is such as fundamental failure that should not have happened with any filesystem, it being ext4 could have been a compounding factor here: ext4 can not guarantee anything to begin with.

That does not mean you should use check -chunks on it periodically as it does not eliminate the problem; just adds burden.

That does mean you should not use it to host backup target in the first place.

In other words if the failure happens to be indeed ext4 issue — well, it’s a non-issue since ext4 is not suitable for long term data storage anyway for other reasons.

Droolio · 20 July 2020 23:26

There were errored backup jobs at the end, obviously, but after clearing up enough disk space, Duplicacy appeared to rely on the presence of those badly-written chunks in subsequent backups.

I used check -chunks to identify bad blocks that would otherwise not have revealed themselves, because as far as Duplicacy was concerned, they 1) existed, and 2) were non-zero bytes. A regular check didn’t, or wouldn’t, have picked that up.

It wasn’t an ext4 issue; it may have been compounded by the fs (transaction logs?), but it was the lack of disk space (in this instance) that caused the issue.

One thing I’d like to test, is if similar scenarios could be replicated by killing the process / TCP connections. I know some (most?) storage back-ends are meant to write to .tmp files first, so it remains a bit of a puzzle.

saspus · 21 July 2020 04:38

That’s a problem. I would think Duplicacy would upload the chunk to some temp file and rename it after upload succeeded. I vaguely remember there were some issues supporting that with certain backends (B2 comes to mind) but for all others that support move operation it should be done that way. @gchen?

That’s my point. It’s not the job of Duplicacy to look for bad blocks on a storage media. It’s operating on a different level of abstraction. When you got data corruption detectable at the application level its already too late. There is nothing application can do to recover. Knowing that the backup rotted is non-actionable.

Instead, had you used redundant array and zfs or BTRFs at the target with periodic scrub you would have had data integrity guarantees and be immune to bad blocks. They would have been corrected and data recovered silently from parity and redundancy ether on Duplicacy accessing the data during restore or during periodic scrub.

In other words asking Duplicacy to effectively emulate scrub is counter productive because failures it would uncover are not actionable.

So ehat I understand is we have two issues here

ext4/quota manager/whatever returned failure writing chunk to the media during backup and Duplicacy counted that partially written chunk as valid on the next backup and restore. If that’s the case it’s a Duplicacy bug and must be fixed.
Data was written successfully but onto media without data consistencies guarantees and over time rotted, as expected. The solution here is not scrub but move to a checksumming filesystem and it is not Duplicacys job to verify storage media. What next — run memory test and check CPU correctness?

I’m not saying the -chunks flag is completely useless; it’s useful to see corruption and serve as a wake up call to start using proper storage; or to prove that the problem is not with the logic but the storage backend. It should not be however used as a part of a healthy backup strategy to verify storage consistency; instead storage that provides those guarantees inherently shall be used.

I agree that upload+rename would help ensure that only fully uploaded chunks are being included but afaik some backends don’t support rename so maybe some other approach to flat the chunk as “ready” should be used. Maybe by adding another file that would indicated they the corresponding chunk is “not ready” and deleting it after Chunk upload success. Sort of poor mans journaling.

snairolf · 21 July 2020 05:49

But isn’t it exactly part of the point for the -chunks flag to verify that the whole backup e2e is correct. Even with proper storage, there can still be memory or CPU errors, etc. So the only way to check the backup is to essentially restore it one way or another.

saspus · 21 July 2020 06:35

Absolutely. In the end the only thing that matters is whether you are able to get your files back and the only way to verify that is to actually try and get your files back.

All the in-between solutions from that up to never checking anything at all are of varying cost/reward ratio and involve assumptions of varying degree of plausibility.

The check command without arguments thereforS seems to be low cost high reward, since for the reliable storage it is not a stretch to assume that if snapshots refers to a chunk and chunk is present the storage will guarantee that it is good.

The -files argument is in the same way high cost / high reward: extremely high resource usage but produces ultimate truth answering the only important question: can files be restored?

The -chunks argument is high cost /low reward. It uses almost the same resources as —files but does not provide an answer on the only meaningful question: Can files be restored?. It validates that the chunk data is still valid in the most inefficient way possible — by downloading the whole file, because it does not know (nor should know) about underlying storage and what guarantees does it provide. (s3 I think supports server side consistency validation but I’m not too familiar with that)

Hence it’s not productive to do that — either do check with —files or no arguments at all. And if the storage is reliable — well, maybe run —files once a decade. That’s my IMHO

Droolio · 21 July 2020 14:11

I think this is wrong on so many levels I don’t know where to start. My case patently proves the opposite - that the storage wasn’t aware of bad data, and it isn’t its job to verify the integrity of the backup data when only Duplicacy can - say if a bug or memory corruption occurs.

After all, there was no bit-rot here. No amount of redundancy or bit-rot detection was going to help. A different fs, perhaps? Resource monitoring, perhaps? I soon learned of the failure either way.

However, even when disk space was freed up, Duplicacy continued to run backups referencing bad chunks, and it was only running different types of check that saved me.

Again, this is verifiably innaccurate. It wasn’t “too late” nor did the backup “rot”. A check -chunks was necessary in order to fix the broken backup storage. And I was able to. The filesystem or storage wasn’t going to help with that.

One thing I learned, is that leaving referenced or even unreferenced bad chunks in the storage can be a very bad thingTM. The best way to deal with that is to run prune -exhaustive to remove them.

Otherwise, subsequent backups might re-reference those chunks - due to the deterministic nature of chunk hashing - i.e. it sees those chunks in the storage, and assumes they’re good and not bother to re-upload them.

IMO you have these the wrong way around. The reason -chunks works better is that it quickly validates the chunks in the storage without downloading them multiple times.

In my case, I had a dozen or so failed and ‘successful’ backups - after the storage had ran out of space, all referencing truncated chunks. As far as Duplicacy and the storage was concerned, there was nothing wrong. When the latest backup is referencing bad chunks, subsequent backups will most likely not have full integrity, but still succeed in creating a snapshot.

Thus using -chunks instead of -files saved me a lot of time in fixing the storage and getting back up and running. A later check -files on the last revision only, also confirm that the storage was in good nick, although I might later do a proper -restore (as I often do).

I actually think this is somewhat of a design flaw (or hole) in Duplicacy’s design…

Plenty of people here have come across 0-byte chunks/snapshots and, thankfully, a regular check now tests for that (although, annoyingly stops processing further). What about non-0-byte bad chunks? IMO, Duplicacy needs more, quicker, integrity checks.

Perhaps a verification stage at the end of a backup job, that tests that all written chunks exist, have the correct file sizes, and the correct hash (for storage backends that support remote hashing).

At the end of the day, I strongly believe both the storage and Duplicacy need to work in tandem to make sure of backup integrity. Plus having multiple backup copies and testing regularly (check, -chunks, -files and restore) is essential.

towerbr · 21 July 2020 23:26

Very interesting discussions and arguments above.

I think the most worrying point is this:

This can cause all revisions using this badly-written chunks to be affected (silently …).

And this is perhaps a good solution proposal:

gchen · 23 July 2020 02:22

For local-disk and sftp storage Duplicacy uploads the chunk to a temporary file and then renames it. For cloud storage Duplicacy doesn’t upload to a temporary file first.

B2 is different than other cloud storage because it doesn’t support file renaming. But file renaming is needed for marking a chunk as a fossil, not for uploading chunks. The workaround for B2 is to hide the file using the b2_hide_file api call which is sufficient for marking the chunks.

kgorlen · 12 July 2021 16:40

I just noticed that check -chunks checks each chunk only once, so it’s not useful for detecting bit rot, and may be why it doesn’t seem to do anything once an initial backup is checked:

2021-07-10 02:51:33.764 INFO SNAPSHOT_VERIFY Verifying 530239 chunks
2021-07-10 02:51:35.178 INFO SNAPSHOT_VERIFY Skipped 529118 chunks that have already been verified before
...
2021-07-10 02:53:30.869 INFO SNAPSHOT_VERIFY All 530239 chunks have been successfully verified
2021-07-10 02:53:33.602 INFO SNAPSHOT_VERIFY Added 1121 chunks to the list of verified chunks

How many chunks would a woodchuck check if a woodchuck could check chunks?

saspus · 12 July 2021 17:30

I think if you delete local cache it will have to redownload and re-verify them all again.

On a separate note – detecting bit rot is hardly useful: either it should be correctable (erasure coding) in which case it does not matter if it occurred or better yet storage should guarantee data consistency. (Knowing that bits got rotted does not help to recover data; data loss has already allowed to occur, and therefore the whole backup solution needs to be modified to prevent that - e.g. switching to checksumming redundant storage or commercial clouds that provide consistency guarantees)

gchen · 12 July 2021 20:44

The file ~/.duplicacy-web/repositories/localhost/all/.duplicacy/cache/storage/verified_chunks saves the list of chunks that have already been verified. If you want to check all chunks again just delete this file.

kgorlen · 13 July 2021 21:21

CrashPlan has a “self healing” feature: Archive maintenance - Code42 Support. During the 10 years I used it, “healing” file uploads were reported several times.

If a chunk is detected as missing or corrupted, there’s still a chance the file(s) that included it are still in the repository such that it could be reconstructed and replaced.

kgorlen · 13 July 2021 21:22

What about an option to automate this, e.g. check -allchunks? And it would be helpful to be able to schedule jobs at monthly intervals with the Web UI.

saspus · 13 July 2021 21:39

Oh wow, this is horrifying! I’m also (former) long time (at least a decade) Crashplan user, but I never noticed that (granted, I was not aware that this is a possibility and wasn’t reading every log message). This means their storage is unreliable crap with no redundancy. This is legitimately scary.

Backup solution that relies on users not losing files… Yeah. And then what about version history? It would be also gone, and silently at that.

I don’t know what to say… just wow… I’m terrified in a hindsight.

kgorlen · 13 July 2021 22:58

I had archives both locally and on CP Central, and I don’t recall which needed “self healing”. But you’ve got to admire CP’s marketing department.

Droolio · 15 July 2021 14:08

Not necessarily their storage arrays - we can’t know that - but wouldn’t surprise me if their reliance on big honking databases is what the real cause here is. The verification procedure took eons.

Anyway, glad I moved away from CP years ago, lost all hope in their product after years of promising a native client (non-Java). The horror stories I’ve heard since of people losing their whole backups (or anything over X GB) and nothin’ Code42 could do about it, is scary enough.