Corrupted chunks

Icydog · 22 February 2021 05:43

This is deeply concerning. I wonder if this is Backblaze api glitch?

I suppose this is possible, but it seems unlikely for such a widely used service which has literally one job. Is there a way I can try to troubleshoot it? Can I download a chunk manually and have duplicacy verify that specific chunk?

Or maybe your local filesystem is screwed or underlying storage media (unless you use ZFS/btrfs)?

I use RAID 6 and have never had any sign of trouble. Besides, in order for storage to be the issue, it would have to be so broken that temporary files (chunks being downloaded and verified) are failing to read back the bytes that were just written moments ago, and this seems very unlikely. Also, as mentioned, the same thing happens on a different machine.

As an experiment, can you duplicacy copy the revision to local repository (or better yet create a new backup in local repository) and verify that local repository?

Sadly, I don’t have enough disk space to be able to create a local copy.

gchen · 22 February 2021 15:58

Can you PM me a log file?

I agree that a Backblaze bug is high unlikely, but a similar one happened to Wasabi before: 'cipher: message authentication failed' when downloading chunks from wasabi · Issue #207 · gilbertchen/duplicacy · GitHub, and we should not exclude any possibilities.

One thing you can try is to select a chunk that was reported as corrupt but has since been verified, download it multiple times to see if downloaded files are different.

Icydog · 22 February 2021 19:28

Can you PM me a log file?

Yes I’d be happy to, just let me know what log you’re looking for and how to find or generate the log.

One thing you can try is to select a chunk that was reported as corrupt but has since been verified, download it multiple times to see if downloaded files are different.

Very interesting. When using the official B2 CLI with b2 download-file-by-name [bucket-name] chunks/[chunkid], the downloads seem to almost always fail on the chunks in question. I arbitrarily chose 4 chunks that initially failed to validate with duplicacy check but later verified, and ran them in a download loop 100 times each. In those 400 download attempts, I was never able to get a single successful download; I always get an SHA1 checksum mismatch.

It seems many repeated attempts are not very helpful. However I ran downloads again once per corrupt-but-later-verified chunk (there are 38 of these), and I was able to successfully get 3 to download with download-file-by-name. I even found an instance of a chunk flip-flopping between good and bad (chunkid refers to one single chunk in the snippet below):

[user@host duplicacy]$ b2 download-file-by-name [bucket] chunks/[chunkid] 4
4: 100%|███████████████████████████████████| 3.27M/3.27M [00:00<00:00, 44.9MB/s]
ConsoleTool command error
Traceback (most recent call last):
  File "b2/console_tool.py", line 1521, in run_command
  File "b2/console_tool.py", line 690, in run
  File "logfury/v0_1/trace_call.py", line 84, in wrapper
  File "b2sdk/bucket.py", line 170, in download_file_by_name
  File "logfury/v0_1/trace_call.py", line 84, in wrapper
  File "b2sdk/transfer/inbound/download_manager.py", line 122, in download_file_from_url
  File "b2sdk/transfer/inbound/download_manager.py", line 134, in _validate_download
b2sdk.exception.ChecksumMismatch: sha1 checksum mismatch -- bad data
ERROR: sha1 checksum mismatch -- bad data
[user@host duplicacy]$ b2 download-file-by-name [bucket] chunks/[chunkid] 5
5: 100%|███████████████████████████████████| 3.27M/3.27M [00:00<00:00, 49.9MB/s]
ConsoleTool command error
Traceback (most recent call last):
  File "b2/console_tool.py", line 1521, in run_command
  File "b2/console_tool.py", line 690, in run
  File "logfury/v0_1/trace_call.py", line 84, in wrapper
  File "b2sdk/bucket.py", line 170, in download_file_by_name
  File "logfury/v0_1/trace_call.py", line 84, in wrapper
  File "b2sdk/transfer/inbound/download_manager.py", line 122, in download_file_from_url
  File "b2sdk/transfer/inbound/download_manager.py", line 134, in _validate_download
b2sdk.exception.ChecksumMismatch: sha1 checksum mismatch -- bad data
ERROR: sha1 checksum mismatch -- bad data
[user@host duplicacy]$ b2 download-file-by-name [bucket] chunks/[chunkid] 6
6: 100%|███████████████████████████████████| 3.27M/3.27M [00:00<00:00, 47.1MB/s]
File name:    chunks/[chunkid]
File id:      [fileid]
File size:    3267110
Content type: application/octet-stream
Content sha1: [sha1]
checksum matches

So it does in fact seem like B2, or at least my files on B2, are hosed.

But one question for Duplicacy: is it sending SHA1 checksums on upload? That should at least prevent transmission errors.

gchen · 22 February 2021 20:56

Can you contact Backblaze support and file a bug report? Also PM me the bug report number so I can ask them to escalate it.

We do send SHA1 checksums when uploading chunks to B2. We just don’t check them on downloads though, because we use our own hashes for checksum.

Icydog · 22 February 2021 21:02

I submitted a ticket and will PM you.

Icydog · 24 February 2021 00:08

Providing an update for anyone following this… tl;dr Backblaze B2 is indeed buggy:

I contacted Backblaze B2 support, who explained to me that B2 stores each file across 20 shards, and as you might expect, not all shards are necessary to construct the file. But there is an issue in how they are currently detecting corrupt shards and instead of returning data from shards that are working, they are sometimes returning corrupted data.
A fix was promised for later this week.

I will update again once things are working so that worried readers can have faith in B2 once more.

saspus · 24 February 2021 01:54

Wow. Thank you for update. I… don’t know what to say. Seems like a core functionality is broken, how come nobody else noticed this?

This also raises a lot of questions – do they not guarantee data consistency? I.e. is this a “best effort” type of deal as opposed to “correct data or nothing”? So many questions…

gchen · 24 February 2021 03:39

I was told a few restic users ran into the same issue too: Fatal: number of used blobs is larger than number of available blobs! · Issue #3268 · restic/restic · GitHub

As long as they still have the original files intact it should be fine. I was surprised that their servers don’t verify files by the SHA1 checksums before sending them out.

tangofan · 24 February 2021 19:43

Wow, indeed.

Sounds like it’s time to look for a provider that takes data integrity a bit more seriously…

towerbr · 25 February 2021 10:54

Wow +1

I was left with the same doubt. I thought their redundancy design was very robust and practically fail-safe. Now I’m in doubt…

I’m here wondering if I should start evaluating others too … I don’t think there is much to escape without being “up”: AWS S3, Google Cloud and Azure.

Droolio · 26 February 2021 14:59

Sounds to me the data is being stored correctly but there’s a temporary bug in their storage logic that’s occasionally not able to pick out the correct blocks.

I don’t use BackBlaze but I wonder if you’d be able to do a copy from the (presently iffy) B2 to local storage and keep retrying when there’s errors. That would show how fatal the issue is - whether it’s bad (data loss) or just intermittent which they can eventually fix their end?

arno · 26 February 2021 15:24

I think it’s a bit early to kick B2 to the curb.

In every other interaction or anecdote that I’ve heard, Backblaze has been a top notch provider and responsive to questions. The restic thread even includes responses from a Backblaze developer.

It seems to me that this is likely a bug that was introduced recently, unless this is the first time they’ve ever had data corruption in a way that could affect requested customer data.

For a singular positive anecdote, I just completed a check -chunks for one of my B2 repos. 6700 chunks and 18GB data all came back fine. (As an aside, this was with Cloudflare proxying the data so I wasn’t charged for any of the download. Admittedly another reason why I like using B2. I may have to turn this in to a scheduled check. Maybe even for my larger repo.)

In any event, they’re being fairly transparent regarding the issue on the restic thread. I’ll be watching this to see the outcome, but I currently still feel like my data is safe with them.

Icydog · 27 February 2021 07:53

The issue is still not fixed but I wanted to chime in on a few things:

Thank you for linking this! The posts by nilayp contain more information than what I was given by Backblaze, but still don’t explain some important things like how this even happened or whether it was a recent regression.

Based on Backblaze’s description, permanent data loss would require 4/20 shards to break, which seems pretty unlikely, so data is probably intact.

I am also very surprised that they either do not check checksums on download, or they check it and return corrupted data anyway. Surely there should be some kind of alerting for returning broken data?

Another question is about how broken shards could be selected for downloads at all. This blog post says: “Every shard stored in a Vault has a checksum, so that the software can tell if it has been corrupted.” How can a broken shard pass a checksum validation?

I agree that it’d be premature to conclude that B2 is untrustworthy, as I’d like to hear more details from them first, but note that 6700 chunks is not very much. I had 42 failures out of 1.4M, a 0.003% failure rate; if these were independently random, you would expect 0.2 failed chunks out of 6700.

However it really boggles my mind that this could even happen. The Restic issue has been open for 21 days, so this has been happening for at least that long. With 10 TB of data, I am nothing to Backblaze yet even I am hitting this; there must be a HUGE number of mega customers being impacted by this right now, and if those customers aren’t carefully checking their downloads’ integrity, they are in turn using corrupted data, serving it to their customers, etc.

arno · 27 February 2021 13:59

Yeah, that’s fair and I definitely want to see the outcome and a follow up breakdown of what went wrong, how, and what is now in place to prevent it again would be pretty much required I’d think. Given their drive stats posts I’d expect it.

arno · 1 March 2021 03:27

The restic issue has been updated with a report that the issue was fixed on 26 Feb.

Any updates to the Backblaze issues or correspondence that anyone here started?

Icydog · 1 March 2021 04:11

Very interesting. The issue is definitely not fixed for me. I just tried using the B2 CLI to download the 42 chunks that Duplicacy reported as corrupted, and the results were 3 successful downloads and 39 SHA1 checksum mismatches. I’ll follow up on my Backblaze support ticket.

Icydog · 2 March 2021 01:50

The response was:

Our engineers have been looking into your account and it appears that after resolving the first issue plaguing your bucket, a second issue had occurred. However, they have informed me that another change went live around noon today, Pacific time.

My level of confidence in B2 keeps dropping with each interaction… two separate data corruption bugs? However, my B2 account seems to be working now and I’m no longer seeing download corruption. I asked for more details on what actually happened here. If they respond, I’ll update this thread.

arno · 2 March 2021 05:29

Thanks for keeping us updated. At least it sounds like no data was actually lost?

Icydog · 3 March 2021 06:52

Right, I don’t think any of my data was lost (I haven’t yet run another full verify due to the cost). It seems what happened is:

Rather than verifying checksums on download, Backblaze relied on an ongoing async job that would look for corruption.
A bad batch of hard drives recently caused a bunch of corruption that made this job run much more slowly than usual, causing it to take a long time to find the broken shards I and the Restic users were downloading.
The issue was affecting at least two vaults. The Restic users and I have our data on different vaults.
A fix which verifies checksums on download was written and deployed on the vault with the Restic users’ data first, but didn’t roll out to every vault until later, which is why I still saw broken files for a while.

My takeaway from this is that though it’s still a little surprising that downloads were never verified until now, the whole response to this seems pretty quick and effective, and I’m personally reassured enough to continue trusting Backblaze with my data.

arno · 3 March 2021 14:16

Thanks for the update and summary. And yeah, the response seems to have been reasonably quick.

Regarding your verify cost, it’s fairly easy to get free downloads by sending your data through CloudFlare. At that point you’re only going to be charged for the API access. I suppose with a large repository that could still end up costing more than you’d want.

Thanks again.