File Integrity Check - Verifying Snapshot is Identical to Repository

Hello, I am interested in Duplicacy. I have been testing it’s features. Is it backing up the repository by hashing chunks? I see there isn’t a mount command. I like to periodic test that files in the backup match those in the repository. I like to do this without restoring and via hash and ctime,mtime.

What is the best way to do a file integrity check of a latest snapshot against the local repository without restoring all the files to a location and using rsync to compare? I don’t have the extra storage to double the data. Thanks!

duplicacy check -chunks will check integrity of chunks in the storage. To avoid egress - run this on the target server. But you should not need do that if your storage provides data integrity guarantees. If it does not — use different storage.

The pdf here describes design. Duplicacy paper accepted by IEEE Transactions on Cloud Computing

Do you mean find the difference between local files and the ones backed up?

Thanks for the reply! To make sure I understand duplicacy check command, I was under the impression it did not compare hashes of the files in the repository to files in the snapshot. I am wanting to verify the files in the latest snapshot match the files in the repository occasionally.

Is there a command to do this or is there a mount command that will allow comparison between the repository and the snapshot?

Yes I am wanting to compare hashes of the local files to the ones backed up.

As described at the link above, add -hash to diff command.

But the more important question is — why do you want to do that?

Thanks, I like to know the backed up files match the ones in the local filesystem. There could be bitrot or something and want to be made aware of in either the local filesystem or the backup.

If I don’t test, how would I know if a file has been silently altered in the filesystem or backup, hash collision, bad hardware.

duplicacy diff I thought from reading it compared snapshot revisions against each other in the storage. I wasn’t aware I could compare a snapshot revision to the local filesystem.

On the same page:

If only one revision is given by -r , the right hand side of the comparison will be the on-disk file.

Bit-rot on the local filesystem does not matter: you have version history to restore to a good copy when it becomes a problem. And if local bit-rot cannot be tolerated at all-- use file system that guarantees data integrity, so it won’t be an issue to begin with.

Bitrot on the target shall not be allowed to happen. Pick storage that guarantees data integrity. For example: ext3 disk – bad. btrfs or zfs with check summing enabled – good. If target media rots – you won’t be able to decrypt the chunks.

Backup history takes care of this. You don’t need to worry about source data rotting because you have intact version backed up, as well as the rotted one. You can restore from any. Same deal with bad hardware. As long as your target is reliable it does not matter if your source is encrypted with ransomware five backups ago. You can restore your data. Hence, the most important thing here is to pick reliable target storage.

I would not worry about hash collision.

In other words, it’s not a job of a backup program to combat bit-rot. Duplicacy supports erasure coding, you can enable it to somewhat mitigate the risk of losing data to bit rot, but this is just a risk reduction, not elimination.

1 Like

This is great, thanks for pointing this out " If only one revision is given by -r , the right hand side of the comparison will be the on-disk file." I appreciate all the information.

Since our previous conversation I have thought more about integrity and your post. I appreciate the discussion and your insight. One question I have is how do you handle version history with duplicacy? Is there a way to log during backup all files that changed, were added, and deleted? I haven’t found an cmd switch that outputs deleted other than using diffs. I do not know of a way though to automate diff checks after backups?

Duplicacy logs which files did it pick up, i.e. changed and new files. It does not log deleted files, because to figure out those it needs to do extra, unnecessary work: duplicacy is designed for each backup to behave like an independent snapshot. This means duplicacy does not need to track what has been deleted or changed compared to the previous snapshot. It creates a new backup with no assumptions: it collects all files into a sausage, slices it up at cleverly determined boundaries into chunks, and then uploads those chunks to the destination. Except if the chunk file is already in the storage—the upload is skipped. Chunk file names are derived from their content, (see Content Addressable Storage) so this operation is superfast and only relies on a basic file system primitives.

This results in only the changed data to be uploaded, with no special tracking required of which, if any, files were changed or deleted: if a file is deleted—there is nothing to do—the deleted file will not be picked up into a sausage, not end up in a chunk, and won’t be present in the snapshot—and that’s the end of it.

If you want to determine which files have been deleted, or otherwise compare snapshots, you can use diff command.

Further reading:

1 Like

This was the best non-technical explanation I’ve seen on how Duplicacy works :grinning_face_with_smiling_eyes:

I always say: simple is always better (KISS), and this simple phrase perfectly explains how the software works.

It deserves a bookmark :wink:

1 Like