Very fast checking files with 228 Gb size

Arty.R · 11 June 2019 16:13

Based on this, i can be sure that everything is fine with the backup? The check takes less than two minutes.
Just the same archive Duplicati checks for very long. I understand that Duplicati has another principle of work, but I want to be sure that when the archive is needed, it will really work.

 Running check command from C:\Users\arty/.duplicacy-web/repositories/localhost/all
    Options: [-log check -storage test_local -a -tabular]
    2019-06-11 01:00:54.112 INFO STORAGE_SET Storage set to D:/TEST/DUPLICACY_WEB/TEST_BACKUP
    2019-06-11 01:00:54.245 INFO SNAPSHOT_CHECK Listing all chunks
    2019-06-11 01:02:14.091 INFO SNAPSHOT_CHECK 1 snapshots and 4 revisions
    2019-06-11 01:02:14.092 INFO SNAPSHOT_CHECK Total chunk size is 228,022M in 49988 chunks
    2019-06-11 01:02:14.652 INFO SNAPSHOT_CHECK All chunks referenced by snapshot test_backup at revision 1 exist
    2019-06-11 01:02:15.015 INFO SNAPSHOT_CHECK All chunks referenced by snapshot test_backup at revision 2 exist
    2019-06-11 01:02:15.357 INFO SNAPSHOT_CHECK All chunks referenced by snapshot test_backup at revision 3 exist
    2019-06-11 01:02:15.994 INFO SNAPSHOT_CHECK All chunks referenced by snapshot test_backup at revision 4 exist
    2019-06-11 01:02:18.795 INFO SNAPSHOT_CHECK 
                snap | rev |                          |  files |    bytes | chunks |    bytes |  uniq |    bytes |   new |    bytes |
     test_backup |   1 | @ 2019-06-09 12:54 -hash | 173994 | 282,483M |  49936 | 227,922M |    16 |  23,178K | 49936 | 227,922M |
     test_backup |   2 | @ 2019-06-09 21:47       | 173930 | 282,484M |  49937 | 227,922M |     0 |        0 |    17 |  23,442K |
     test_backup |   3 | @ 2019-06-10 01:00       | 173930 | 282,484M |  49937 | 227,922M |     0 |        0 |     0 |        0 |
     test_backup |   4 | @ 2019-06-11 01:00       | 174051 | 282,531M |  49956 | 227,977M |    35 |  79,072K |    35 |  79,072K |
     test_backup | all |                          |        |          |  49988 | 228,022M | 49988 | 228,022M |       |          |

TheBestPessimist · 11 June 2019 16:15

yep, this means that everything is ok with your backups!

saspus · 11 June 2019 17:07

Yes, and maybe.

Yes in the sense that it verified that all chunks necessary to reconstruct the files are present in the storage. But it does not verify that the data in the chunks is valid/original/uncorrupted.

It is assumed that once the file is written to the storage it is immutable. This is obvious and implied but not always guaranteed.

Verifying that would require downloading of entire dataset from the remote destination since the duplicacy runs on the client. And then it’s not really the job of backup software to question the storage integrity: what are going to doubt next? CPU correctness? Time space continuity?

That is why it is important to use bit-rot aware storage as a destination (such as zfs or btrfs), not just plain old hard drive with fat32

Arty.R · 11 June 2019 17:43

I use another local drive for local backup (for faster restore), 4 bay Synology for backup stored to another location and third cloud backup (if suddenly there was a fire and everything burned)
And I want to understand how better organize a correctly check so as not to wait very long and, on the other hand, be sure that the archives is really reliable.

saspus · 11 June 2019 17:49

Textbook perfect!

And I want to understand how better organize a correctly check so as not to wait very long and, on the other hand, be sure that the archives is really reliable.

With the local storage you can pass -files parameter to the check command and duplicacy will validate integrity of all files.

For Synology hosted storage – assuming you use BTRFS there and run periodic (Annual at least) scrub you are fine – synology will take care of integrity.

For cloud backup – you pay them money to store data for you, they better not corrupt it. Most of storage providers have various tiers of guarantees, make sure you chose the one that guarantees data integrity

Droolio · 11 June 2019 19:03

IMO, Duplicacy could potentially help out here, in the same way that rclone check can verify the integrity of remote files without having to download them - instead, it could use the cloud API’s to fetch the hashes.

The only issue is where to store the hashes. I see two ways…

The snapshot file; a hashes key/sequence, alongside chunks and lengths. This has the benefit of storing this additional metadata inside de-duplicated chunks, making it scale well.
Store the hash as part of the chunk filename, e.g.:<id>.<hash>. Dunno how much extra processing Duplicacy would have to do in parsing filenames.

Then you could run duplicacy check -hash and be extra confident that your data is in good nick.

Different cloud providers support different hash types (MD5, SHA-1 etc.), so Duplicacy would need to pick the best and handle things when copying between storages (even manually), or when the hash isn’t present (legacy).

I know it’s been talked about before on here but I thought I’d throw this idea out there again as a worthwhile feature request.

saspus · 11 June 2019 19:18

This would help guard against someone else, even maybe duplicacy itself, or user via other tools messing with and corrupting the chunk files though cloud service API – which I agree is a useful feature to have.

However it would not remove the need to select cloud service which guarantee consistency: otherwise they may be just optimizing on their end and storing hash instead of computing it at request – so they would return correct hash as of last API call and yet the storage could be long as rotten.

Droolio · 11 June 2019 19:28

You raise a very good point actually, and one I did think about. I wonder how each cloud provider handles it, would be useful to research!

But remember, you can have local and sftp storages too - the latter, which (with shell access), can also hash on the remote side, and this is definitely done in real-time.

Christoph · 11 June 2019 20:12

So SFTP can do the hashing on any remote server (with SFTP access, of course)? Interesting. So assuming that all cloud providers provide hashes via their API, the only backend where this wouldn’t work is WebDAV, right?

Not exactly what you’re looking for but because I stumbled upon it:

https://doc.owncloud.org/desktop/2.4/architecture.html#upload

I have a slight suspicion that this is what most cloud storage providers do: calculate upon upload and store with file… Not good.

Droolio · 11 June 2019 20:30

According to the rclone docs, yes - provided there’s shell access to md5sum or sha1sum and echo.

It appears WebDAV can do remote hashing, but only if hosted on Nextcloud.

Damn, that’s a shame.

In which case, it would only serve a useful purpose if Duplicacy checks against the hash on each upload. In which case, storing hashes would be fruitless, but doing the extra check on-the-fly might still be beneficial. In fact, does it already do it, I wonder?