Feature Suggestion: Possibility of verifying hash of chunks files using external tools

Flibble · 22 October 2018 18:13

Hello,
Duplicati2 have feature of “upload-verification-file” which upload .json file contains hash of all files(chunks) in remote storage.

Description of this feature:
Use this option to upload a verification file after changing the remote storage. The file is not encrypted and contains the size and SHA256 hashes of all the remote files and can be used to verify the integrity of the files.

Users with access to remote storage can use Python or Powershell scripts to verify backup files remotely.

I think this might be useful for Duplicacy because:

Risk of damage in backup data is in Duplicacy more serious, because DC can not recover from missing/damaged chunk and all affected snapshots are affected/damaged.
Running “check -files” is very slow process
For running check on remote server, user will need Duplicacy binary, and manualy set up repository with storage as local one.
Password is needed for verifying storage at remote server - so it can’t be done securely by someone else.

The main reason why checking for file hashes can be protection against “bit rot” or prevention of bad block etc on own servers.

saspus · 22 October 2018 19:07

It is out of scope of duplicacy’s duties to verify reliability of the storage backend.

Duplicacy, as most other tools, works on the assumption that the storage is reliable: Once the file is uploaded it will not change.

If the user is concerned that the files at rest at the remote storage could get corrupted he/she is free to create such verification file herself/himself – there is no need for Duplicacy involvement, it’s just a file – and run it from time to time. But instead, I would change storage provider if that becomes a concern.

Droolio · 23 October 2018 00:03

I disagree that verifying reliability of a storage is outside the scope of Duplicacy - it already has check and check -files for that purpose - however, I don’t see how actual protection against bit rot is necessarily a job for Duplicacy (unless it would be super-easy to implement some kind of optional recovery record chunks?), or is served by another kind of check. You’re not gonna recover from errors by merely knowing a few chunks got corrupted, though regular checking is still good to do so you can act on it…

But I see the point - check -files - is slow, and there is a way to speed up the process. Many cloud storages, and even sftp with shell access, can let you remotely hash files, without downloading them. rclone uses it to determine matching files.

Now, Duplicacy could assist here and produce hashes of all generated chunks but the complication of different cloud providers using different hash types would make implementation a bit messy.

Instead, and in theory, a third-party tool could remotely retrieve hashes in whatever type it wanted and dump them to a file. You could regularly run this on a Duplicacy storage, processing only new files (chunks and snapshots) for efficiency. In fact, I wonder if rclone might already be able to do this?

saspus · 23 October 2018 02:50

Yes, not sure if you are agreeing or disagreeing, but to clarify what I meant:

There are two parts to this:

Verifying consistency of duplicacy’s “datastore” and restorability of all/any file(s), (assuming storage is reliable).
Detecting data rot (verifying the assumption above)

The first item absolutely is responsibility of the duplicacy, and is already implemented. This helps detect datastore corruption due to logical bugs (and I use “datastore” here as “database” in a general sense; in duplicacy’s case the implementation allows for quick checks that boil down to reading snapshot files and then verifying that specific chunk files they reference merely exist).

Anything beyond that is indistinguishable from “restore everything into /dev/null”: the only way to verify that files are restorable from flaky storage is to try to restore them, touching all bits involved along the way. But this is very expensive to do from the client (both in technical and financial senses) and therefore is not realistically suitable for periodic checks for most storages. (Most storage API operate on an assumption that storage is reliable and therefore do not even provide facilities to force remote verification in the first place).

Maintaining integrity of the chunk files (bit rot) therefore should be (and traditionally is) done on the server and usually delegated to underlying storage (e.g. filesystems such as ZFS and BTRFS built-in checksum facilities or S3 type storage service level guarantees).

If storage in use does not provide any of those facilities – then yes, I absolutely agree that out-of-band checks should be implemented by the user for periodic re-hashing, for example via rsync’s remote checksum support you have mentioned or other server-side means.

There is a third part,

Recovery from a data rot.

But I don’t’ think duplicacy, like any other pieces of software, was designed with that in mind – this would require storage redundancy and therefore inflated storage costs to take over what is essentially filesystem’s job.

Instead, and in theory, a third-party tool could remotely retrieve hashes in whatever type it wanted and dump them to a file. You could regularly run this on a Duplicacy storage, processing only new files (chunks and snapshots) for efficiency. In fact, I wonder if rclone might already be able to do this?

This third party tool shall work server-side, otherwise it is not distinguishable from downloading entire datastore every time.

Christoph · 23 October 2018 05:02

Well, it is recommended to have at least two backups and duplicacy has the copy feature for that. So there should be an easy recovery process without extra cost (or perhaps minimal transfer costs) if at least two backups exist and if they are copy compatible, no?

BTW: you can use the quote-feature multiple times in a post: just navigate to whichever post you want to quote, select text, and hit quote.

saspus · 23 October 2018 05:09

Yep! That’s another great approach – it’s more efficient to have two less reliable backups than invest tons of resources into improving reliability of a single one; (RAID operates on the similar principles after all)

Ah, there is a huge Quote button popping up… I did not realize it was a button… I used to copy/paste the markup instead, but on a mobile it’s too much work. awesome!

kevinvinv · 23 October 2018 14:01

I think this is actually a big problem. @gchen and I have spoken about it in the past on (and off) this forum. It is a big hole in the methodology in my humble opinion and I actually worry about it quite a bit. Crashplan did server side verification and healing… of course we are all a little upset with Crashplan these days but nevertheless, they understood that anything can happen to the data as it is being uploaded or sitting on the server.

@gchen had some reasons why he couldn’t easily address this but I forget what they were. Perhaps he can chime in again.

Christoph · 23 October 2018 14:28

You can’t really compare duplicacy to Crashplan, though. Crashplan had to run both client and server-side. Duplicacy only runs client side.

kevinvinv · 23 October 2018 14:41

Totally agree… it is nice to only run a client side app… but nevertheless, I feel like this is a hole in an otherwise very robust backup scheme. Perhaps I am just being too paranoid

gchen · 24 October 2018 02:10

What @kevinvinv asked for is Adding a feature to help with server side data integrity verification · Issue #330 · gilbertchen/duplicacy · GitHub, which is in my plan, but I wanted to tackle it when I have a chance to upgrade the encryption scheme to support per-user encryption (so with the master key the administrator can verify all chunks without knowing the per-user passwords).

The hashes of chunk files will enable us to find file corruption, but not file deletion (missing chunks). This looks like a simple feature to implement, but I wonder if this alternative would work without adding code to Duplicacy: you can periodically scan the storage and compute the hashes of new chunks and store them in a database. The only difference is that the file may become corrupted after Duplicacy uploads it and before you see it for the first time, but I guess if you do it frequently this would be less of a problem.

kevinvinv · 24 October 2018 16:50

Hi @gchen, thanks for the input.

I personally , sorry to say, would want the hashes computed prior to uploading- this eliminates any possibility of a corrupted upload from going unnoticed.

Your suggestion of checking the chunk hashes over time is a good one too but not as good as doing it prior to the heavy transport lift etc.

Adaptation · 27 October 2018 01:17

I’ll take this opportunity to promote a PR I put together a couple of weeks ago:
Snapshot uploaded chunk lengths by adapt0 · Pull Request #500 · gilbertchen/duplicacy · GitHub

It includes the final compressed chunk sizes in snapshots, allowing for an additional chunk consistency check to be performed. Checking both the existence + size of the chunks on the destination. This can be optionally engaged during the client backup process to self heal chunks which are of the incorrect size.

While this does not validate the actual server-side bit contents of the files. I have used it to self-heal storage where several chunk files had become corrupted (zero sized) during the backup process. Correcting not only the latest snapshot, but several previous snapshots which referenced the same corrupted chunk.

I don’t mean to derail this discussion, as I do believe that being able to verify the bit integrity of the remote storage chunks is super important. Just wanted to mention my PR, as it can help address some of these concerns.

Droolio · 27 October 2018 15:32

That’s pretty cool. How easy would it be to add MD5 hashes (and optionally SHA1 or other hash types) instead of, or in addition to, chunk sizes? That way, you could remotely verify chunks from most cloud services - without re-downloading them - either as you upload them during a backup, or later during a check.

Adaptation · 28 October 2018 01:20

It’s fairly easy to extend the snapshots with additional fields, along with extending the self-healing logic to compare against retrieved hashes. We’d need to look at all of the various storage backends to find out which hashes are best to store & retrieve for this. As well as how to get the backend to calculate the relevant hash…

For example, if we take a quick look at the backends I’ve been using:

DigitalOcean S3 - There’s an ETag which contains an MD5 hash
sftp - Can use shell commands to perform hashing (provided shell access is available)