Feature Request: Backblaze B2 backend could/should use contentSha1 during check

In the b2_list_file_names API response, there is a contentSha1 field, containing the sha1 hash of the chunk as computed by B2. If I understand the check code correctly, this sha hash is not used for the check, and I think using it would be a rather good improvement. To me this seems to be a very good “middle way” between the current “check” (which only checks the file list?) and “check -files” (which downloads each file and compares(?) (which would be rather expensive)).

(BTW, there is also the contentLength field, do you compare file sizes in the check?)

Such a check would at least make me sleep better :slight_smile: Thanks a lot for an awesome software!

Is B2 calculating those SHA1 hashes on the fly? Or is it storing them as object metadata? If it’s just reading the SHA1 it calculated during object creation, then I don’t think adding another hash to the check process actually tells you anything more about whether the storage provider did their job in protecting the integrity of every single chunk object.

Related discussion in this thread:

Duplicacy already sends the SHA1 during upload, so really this would just be to verify the integrity of the storage provider instead of just the Duplicacy snapshots.

Well, if B2 actually uses these hashes to verify the chunk uploads (which I would do ;)), then it would be useful for this kind of check.

(What I’m after is a more waterproof check than just verifying that the local and remote lists of chunks are equal. Of course it involves some level of “trust” of the provider, but this kind of check is way cheaper than “check -files”. Hm, sorry, I’m repeating myself :))

Duplicacy already uses SHA1 with B2 to verify uploads. After the upload, if the hash that B2 calculates doesn’t match what Duplicacy calculates, then there’s an error. So for all of the blocks in storage this check is already being done.

If by “this kind of check” you’re referring to a check long after the block is uploaded (not during upload), then I don’t think this is a given.

Duplicacy already stores the SHA256 hash for each chunk in the filename/path for the chunk. In order to use the SHA1 hash while checking the storage, Duplicacy would have to start storing the SHA1 hash somewhere so that it has something to compare against the values B2 returns.

The only thing (I can think of) that storing the SHA1 hashes would get us is validation that B2 didn’t silently corrupt the files after upload (since during upload is already checked). However, this is only true if Backblaze is calculating fresh SHA1 checksums for every b2_list_file_names API request – which seems really inefficient; if it’s just reading the saved-off SHA1 hash that it calculated once immediately after the block was uploaded, then I don’t see how using the file’s SHA1 sum that B2 returns adds any reassurance over what is currently there.

1 Like

Thanks a lot for your comments, leerspace!

With “this kind of check” I meant that duplicacy downloads the chunk list (which contains the contentSha1 field), and then walks through the files I have locally, and computes the checksum of those and compares. So, it wouldn’t have to store the hash locally since it would be able to compute it from the local files proper.

Ah, hm. That last sentence made me realize that I might be misunderstanding something. Such a check would only be useful immediately after a backup (or at least as long as no local file was changed). But, isn’t the same thing true for “check -files”?

I have to think :slight_smile:

I don’t think this makes any sense; or at least, I don’t understand how this could possibly work. From B2’s perspective, it is storing (possibly encrypted) chunks of data in objects. And the chunks B2 has SHA1 hashes for could contain multiple files, parts of files, or sometimes only one file. On the client, duplicacy could calculate the SHA1 hashes of the files it has locally, but it doesn’t make sense to me to compare SHA1 hashes from chunks in B2 against SHA1 hashes of local files since there’s no way to map between the two. And even if there were, the local files likely changed for legitimate reasons and don’t have any bearing on the integrity of the snapshot and chunks in storage.

I don’t think this is true. The reason is because with check -files duplicacy is downloading the chunk from the storage provider and re-hashing it to make sure it matches the expected SHA256 hash. check -files is actually validating the integrity of the snapshots and chunks as they currently exist in storage rather than just checking to make sure that the chunks all exist (if not using the -files option).

I think I’ve said all I can without repeating myself on this thread, so I’ll give others a chance to weigh in! :slight_smile: