Corrupted chunks

Icydog · 21 February 2021 04:50

I have a duplicacy backup that has 265 increments over the course of several years. I ran:

duplicacy check -r 265 -chunks -threads 32 -persist

and it gave me:

42 out of 1405935 chunks are corrupted

This seems very bad… I have a few questions about this:

How do I figure out which chunks are bad?
It seems the way to fix this going forward is to (1) delete the corrupted chunks; (2) change the repository id; (3) make a new backup of the same files and hope those chunks get re-uploaded; (4) change the repository id back - is this accurate?
How do I figure out what caused this and prevent it from happening in the future? I’ve run check without -chunks many times in the past and no errors were found as all chunks were present.

Edit: I ran the same command (duplicacy check -r 265 -chunks -threads 32 -persist) a few more times and am left with more questions than answers. I ran it 4 more times after the first time. After run #1 (described above), I was told I had 42 corrupted chunks. After run #2, 5 chunks apparently verified correctly and I was left with 37 corrupted chunks. After run #3, it went down another 5 to 32 corrupted chunks. Run #4 didn’t successfully verify any chunks. Run #5 verified one more chunk and its output was:

31 out of 1405935 chunks are corrupted
Added 1 chunks to the list of verified chunks
Added 1 chunks to the list of verified chunks

What could cause this type of behavior where chunks take many many tries to verify successfully? Can I be confident that these chunks are actually not corrupted?

I am using Backblaze B2.

saspus · 21 February 2021 08:39

I’m wondering if what is getting corrupted is your local cache.

Can you delete the ~/.duplicacy/cache (or if you use web version the corresponding cache locations) and retry?

Icydog · 21 February 2021 22:07

I don’t think the local cache is the problem for two reasons:

duplicacy check -chunks doesn’t seem to cache the chunks. I have >6TB of data backed up in 1.4M chunks, but .duplicacy/cache/ has only 108 GB of stuff.
I was able to figure out how to get a list of corrupt chunks: running the command with -persist repeatedly will result in a bunch of errors being logged for corrupt chunks while skipping already-verified chunks. However, none of the corrupt chunks can be found in .duplicacy/cache/.

gchen · 22 February 2021 03:26

If a chunk can be verified in one run but not in others, I tend to believe this is a memory corruption problem.

I would run the check again without the -threads options: duplicacy check -r 265 -chunks -persist. You only have 31 chunks to verify now so there is no need for many threads.

If none of these 31 chunks can be further verified, I would suggest running the check on a different computer. Copy over the .duplicacy/cache/storage/verifieid_chunks file to focus on these 31 chunks.

Icydog · 22 February 2021 04:38

Since my previous reply, I ran duplicacy check -r 265 -chunks -threads 32 -persist 7 more times (total of 12 times), and it brought the corrupted chunk count down to 11.

Upon your suggestion, I ran it a 13th time without the -threads 32 and it verified one more chunk, bringing the corrupt chunk count to 10.

I really doubt memory corruption is the issue here because this machine is running ECC RAM and there has never been any sign of memory corruption on the machine.

But just to follow through, I then copied the .duplicacy/preferences and .duplicacy/cache/default/verified_chunks files to a different physical machine running a different OS (original machine is CentOS 8; this one is Fedora 33) and ran duplicacy check -r 265 -chunks -persist. This got me a bit farther and verified 4 of the remaining 10 corrupted chunks, leaving 6 still marked corrupted.

It seems duplicacy tries 4 times per chunk before deciding a chunk is corrupt; that means some chunks are being marked as verified after 50+ tries, and some are still not verified and I don’t know if they are truly corrupt. This seems very bad.

saspus · 22 February 2021 05:22

This is deeply concerning. I wonder if this is Backblaze api glitch? Or maybe your local filesystem is screwed or underlying storage media (unless you use ZFS/btrfs)?

As an experiment, can you duplicacy copy the revision to local repository (or better yet create a new backup in local repository) and verify that local repository? If you get constant number (or zero in case of new repo) failed chunks — then it’s definitely an issue with Backblaze API or in how duplicacy uses Backblaze api.

Icydog · 22 February 2021 05:43

This is deeply concerning. I wonder if this is Backblaze api glitch?

I suppose this is possible, but it seems unlikely for such a widely used service which has literally one job. Is there a way I can try to troubleshoot it? Can I download a chunk manually and have duplicacy verify that specific chunk?

Or maybe your local filesystem is screwed or underlying storage media (unless you use ZFS/btrfs)?

I use RAID 6 and have never had any sign of trouble. Besides, in order for storage to be the issue, it would have to be so broken that temporary files (chunks being downloaded and verified) are failing to read back the bytes that were just written moments ago, and this seems very unlikely. Also, as mentioned, the same thing happens on a different machine.

As an experiment, can you duplicacy copy the revision to local repository (or better yet create a new backup in local repository) and verify that local repository?

Sadly, I don’t have enough disk space to be able to create a local copy.

gchen · 22 February 2021 15:58

Can you PM me a log file?

I agree that a Backblaze bug is high unlikely, but a similar one happened to Wasabi before: 'cipher: message authentication failed' when downloading chunks from wasabi · Issue #207 · gilbertchen/duplicacy · GitHub, and we should not exclude any possibilities.

One thing you can try is to select a chunk that was reported as corrupt but has since been verified, download it multiple times to see if downloaded files are different.

Icydog · 22 February 2021 19:28

Can you PM me a log file?

Yes I’d be happy to, just let me know what log you’re looking for and how to find or generate the log.

One thing you can try is to select a chunk that was reported as corrupt but has since been verified, download it multiple times to see if downloaded files are different.

Very interesting. When using the official B2 CLI with b2 download-file-by-name [bucket-name] chunks/[chunkid], the downloads seem to almost always fail on the chunks in question. I arbitrarily chose 4 chunks that initially failed to validate with duplicacy check but later verified, and ran them in a download loop 100 times each. In those 400 download attempts, I was never able to get a single successful download; I always get an SHA1 checksum mismatch.

It seems many repeated attempts are not very helpful. However I ran downloads again once per corrupt-but-later-verified chunk (there are 38 of these), and I was able to successfully get 3 to download with download-file-by-name. I even found an instance of a chunk flip-flopping between good and bad (chunkid refers to one single chunk in the snippet below):

[user@host duplicacy]$ b2 download-file-by-name [bucket] chunks/[chunkid] 4
4: 100%|███████████████████████████████████| 3.27M/3.27M [00:00<00:00, 44.9MB/s]
ConsoleTool command error
Traceback (most recent call last):
  File "b2/console_tool.py", line 1521, in run_command
  File "b2/console_tool.py", line 690, in run
  File "logfury/v0_1/trace_call.py", line 84, in wrapper
  File "b2sdk/bucket.py", line 170, in download_file_by_name
  File "logfury/v0_1/trace_call.py", line 84, in wrapper
  File "b2sdk/transfer/inbound/download_manager.py", line 122, in download_file_from_url
  File "b2sdk/transfer/inbound/download_manager.py", line 134, in _validate_download
b2sdk.exception.ChecksumMismatch: sha1 checksum mismatch -- bad data
ERROR: sha1 checksum mismatch -- bad data
[user@host duplicacy]$ b2 download-file-by-name [bucket] chunks/[chunkid] 5
5: 100%|███████████████████████████████████| 3.27M/3.27M [00:00<00:00, 49.9MB/s]
ConsoleTool command error
Traceback (most recent call last):
  File "b2/console_tool.py", line 1521, in run_command
  File "b2/console_tool.py", line 690, in run
  File "logfury/v0_1/trace_call.py", line 84, in wrapper
  File "b2sdk/bucket.py", line 170, in download_file_by_name
  File "logfury/v0_1/trace_call.py", line 84, in wrapper
  File "b2sdk/transfer/inbound/download_manager.py", line 122, in download_file_from_url
  File "b2sdk/transfer/inbound/download_manager.py", line 134, in _validate_download
b2sdk.exception.ChecksumMismatch: sha1 checksum mismatch -- bad data
ERROR: sha1 checksum mismatch -- bad data
[user@host duplicacy]$ b2 download-file-by-name [bucket] chunks/[chunkid] 6
6: 100%|███████████████████████████████████| 3.27M/3.27M [00:00<00:00, 47.1MB/s]
File name:    chunks/[chunkid]
File id:      [fileid]
File size:    3267110
Content type: application/octet-stream
Content sha1: [sha1]
checksum matches

So it does in fact seem like B2, or at least my files on B2, are hosed.

But one question for Duplicacy: is it sending SHA1 checksums on upload? That should at least prevent transmission errors.

gchen · 22 February 2021 20:56

Can you contact Backblaze support and file a bug report? Also PM me the bug report number so I can ask them to escalate it.

We do send SHA1 checksums when uploading chunks to B2. We just don’t check them on downloads though, because we use our own hashes for checksum.

Icydog · 22 February 2021 21:02

I submitted a ticket and will PM you.

Icydog · 24 February 2021 00:08

Providing an update for anyone following this… tl;dr Backblaze B2 is indeed buggy:

I contacted Backblaze B2 support, who explained to me that B2 stores each file across 20 shards, and as you might expect, not all shards are necessary to construct the file. But there is an issue in how they are currently detecting corrupt shards and instead of returning data from shards that are working, they are sometimes returning corrupted data.
A fix was promised for later this week.

I will update again once things are working so that worried readers can have faith in B2 once more.

saspus · 24 February 2021 01:54

Wow. Thank you for update. I… don’t know what to say. Seems like a core functionality is broken, how come nobody else noticed this?

This also raises a lot of questions – do they not guarantee data consistency? I.e. is this a “best effort” type of deal as opposed to “correct data or nothing”? So many questions…

gchen · 24 February 2021 03:39

I was told a few restic users ran into the same issue too: Fatal: number of used blobs is larger than number of available blobs! · Issue #3268 · restic/restic · GitHub

As long as they still have the original files intact it should be fine. I was surprised that their servers don’t verify files by the SHA1 checksums before sending them out.

tangofan · 24 February 2021 19:43

Wow, indeed.

Sounds like it’s time to look for a provider that takes data integrity a bit more seriously…

towerbr · 25 February 2021 10:54

Wow +1

I was left with the same doubt. I thought their redundancy design was very robust and practically fail-safe. Now I’m in doubt…

I’m here wondering if I should start evaluating others too … I don’t think there is much to escape without being “up”: AWS S3, Google Cloud and Azure.

Droolio · 26 February 2021 14:59

Sounds to me the data is being stored correctly but there’s a temporary bug in their storage logic that’s occasionally not able to pick out the correct blocks.

I don’t use BackBlaze but I wonder if you’d be able to do a copy from the (presently iffy) B2 to local storage and keep retrying when there’s errors. That would show how fatal the issue is - whether it’s bad (data loss) or just intermittent which they can eventually fix their end?

arno · 26 February 2021 15:24

I think it’s a bit early to kick B2 to the curb.

In every other interaction or anecdote that I’ve heard, Backblaze has been a top notch provider and responsive to questions. The restic thread even includes responses from a Backblaze developer.

It seems to me that this is likely a bug that was introduced recently, unless this is the first time they’ve ever had data corruption in a way that could affect requested customer data.

For a singular positive anecdote, I just completed a check -chunks for one of my B2 repos. 6700 chunks and 18GB data all came back fine. (As an aside, this was with Cloudflare proxying the data so I wasn’t charged for any of the download. Admittedly another reason why I like using B2. I may have to turn this in to a scheduled check. Maybe even for my larger repo.)

In any event, they’re being fairly transparent regarding the issue on the restic thread. I’ll be watching this to see the outcome, but I currently still feel like my data is safe with them.

Icydog · 27 February 2021 07:53

The issue is still not fixed but I wanted to chime in on a few things:

Thank you for linking this! The posts by nilayp contain more information than what I was given by Backblaze, but still don’t explain some important things like how this even happened or whether it was a recent regression.

Based on Backblaze’s description, permanent data loss would require 4/20 shards to break, which seems pretty unlikely, so data is probably intact.

I am also very surprised that they either do not check checksums on download, or they check it and return corrupted data anyway. Surely there should be some kind of alerting for returning broken data?

Another question is about how broken shards could be selected for downloads at all. This blog post says: “Every shard stored in a Vault has a checksum, so that the software can tell if it has been corrupted.” How can a broken shard pass a checksum validation?

I agree that it’d be premature to conclude that B2 is untrustworthy, as I’d like to hear more details from them first, but note that 6700 chunks is not very much. I had 42 failures out of 1.4M, a 0.003% failure rate; if these were independently random, you would expect 0.2 failed chunks out of 6700.

However it really boggles my mind that this could even happen. The Restic issue has been open for 21 days, so this has been happening for at least that long. With 10 TB of data, I am nothing to Backblaze yet even I am hitting this; there must be a HUGE number of mega customers being impacted by this right now, and if those customers aren’t carefully checking their downloads’ integrity, they are in turn using corrupted data, serving it to their customers, etc.

arno · 27 February 2021 13:59

Yeah, that’s fair and I definitely want to see the outcome and a follow up breakdown of what went wrong, how, and what is now in place to prevent it again would be pretty much required I’d think. Given their drive stats posts I’d expect it.