"-hash" option on existing backup picks up all files for reupload

tangofan · 25 April 2020 22:09

I’m currently running Duplicacy-Web 1.3.0 inside a docker container on a Synology NAS. My original backup was started on March 15th with Duplicacy-Web 1.2.0. I normally do not use the “-hash” option for my backups, but was trying it out in a dry-run. (FWIW even the original backup did not use the -hash option either according to the logs.)

To my great surprise the log of this dry-run with the “-hash” option showed that the backup actually picked up every single file in my backup source for backup. Shouldn’t Duplicacy keep track of the hash values (even when no “-hash” option is specified) and so only pick up those files whose hash value has changed? Certainly the extra effort to calculate the hash value of changed files would be negligible.

Droolio · 25 April 2020 23:21

While the hash value of the files may be the same, the hash value of the chunks when they were first uploaded may be different, since they may have been packed together with other files which are since pruned. Duplicacy should still attempt to skip most chunks if the’re already there, thus saving on bandwidth and disk space.

The effect of -hash means some fresh chunks may get uploaded - particularly at the boundaries between the start and end of files - where new files are added and old files are deleted since they were first packed.

tangofan · 26 April 2020 02:13

@Droolio,

So if I understand you correctly, Duplicacy doesn’t actually calculate and remembers any hashes for individual files, but instead will just pack all files into chunks again, but only those chunks that don’t exist on the storage (as determined by the chunks hash value/name) are actually uploaded.

Is that correct?

Droolio · 26 April 2020 16:36

No it does remember file hashes as well.

One reason is verifying data integrity when doing a restore or with check -files. But when asking it to backup with -hash, it recomputes hashes on the chunk level and then checks if the chunk exists on the destination before uploading.

Most chunks should already exist and will be skipped due to de-duplication. Because some old chunks will have parts of files that were long deleted or moved about since they were first packed, those hashes may not align up with some of your new hashes. Effectively, this makes the storage more efficient since it clears out remnant data.

Without -hash, Duplicacy doesn’t look at file contents for files where its metadata (i,e. size and modification timestamp) haven’t changed.

tangofan · 26 April 2020 20:18

Perhaps I’m a bit dense, but the current backup behavior doesn’t make any sense to me, if Duplicacy already knows the file hashes.

Since the backup will get the list of files with timestamps, it could presumably get the file hashes as well at that point.
And then all it needs to do, it to hash the local file and compare to the known file hash. If the hash matches, the file can be skipped immediately (just like a matching timestamp, when the hash option isn’t used).
And if the hash doesn’t match, then the file should be processed further.

So there wouldn’t be any need to build all those chunks, just to discard most of them. And the logs would be much cleaner, since it would only list files whose current hash actually is different from the known hash in the backup set.

Pining @gchen on this one…

Droolio · 26 April 2020 21:43

This wouldn’t clean up the remnants of old files still stuck in partial chunks. If you’ve moved a lot of files about, deleted stuff, added new files, a backup with -hash will ensure these are repacked. Though most chunks shouldn’t require re-uploading, it still assumes all files have been touched.

Duplicacy does this in one pass. Hashing a file (maybe a big file) and then having to go back and chunk and hash those chunks again if the file doesn’t match… which file? It could compare the previous hash of the file with the same pathname. What if that file moved location? Or was renamed.

tangofan · 26 April 2020 23:34

Well, so what if it doesn’t? Is the purpose of -hash to “clean up” your chunks or to catch files that haven’t been backed up because their content changed without the timestamp changing? I had always assumed the latter.

Yes, it should compare against the file with the same file- and pathname. If that file was moved or renamed, then it would proceed as it does now: Put the moved/renamed files into chunks and - if they don’t exist already - upload those chunks.

So what I am proposing does not remove any steps from the current algorithm, it just adds an additional step to skip files in place whose hash value haven’t changed. And I’d say that would be the vast majority of cases. And even if it weren’t, then nothing would be lost, since the extra computation (if any) is minimal.

Droolio · 27 April 2020 12:21

I’d say the side-effect of -hash is that it cleans up remnant data. All it does is to treat every file as ‘new’.

Why? Duplicacy doesn’t care about moved files and renames - that’s how the de-duplication works.

Hmmm. What you’re proposing is a special case whereby at best it hashes each file in one pass, or at worst in two passes. Potentially very costly with very large files. Versus the current algorithm which does it all in one pass and gets rid of remnant parts of files (which could be many).

tangofan · 29 April 2020 07:16

@Droolio,

Before continuing with the discussion I’d just like to say that I appreciate very much that you continue to engage with me on this topic.

Indeed you are correct. That’s why I think that the name “-hash” is a particularly bad one. The option should have been named something like “-forceall” to indicate that it forces a selection of all files for backup and package into chunks (and only the deduplication step will weed out already existing chunks).

That how the deduplication step works. However the previous step, the file selection step does NOT work that way, at least not for timestamp comparisons. There (I presume) it compares timestamps from the storage against the corresponding file with the same file- and path name.

Actually my option would not incur any extra cost. As you yourself said, Duplicacy already calculates the file hash for backup checks, so it has to do that anyway. And it would avoid the extra cost of packing all those files into chunks only to discard a lot of them.

Now I could see that there is actually a place for the current behavior, so here is what I am proposing:

Option: -forceall [The current “-hash” behavior (as I understand it)]
Duplicacy selects all files (from the file selection set) for backup and packs them into chunks. Only those chunks that don’t exist yet, will be save/uploaded to the storage.
Option: -filehash (just to give a slightly different name)
Duplicacy selects all files (from the file selection set) for backup, whose content hash does not match the hash value in the storage. All selected files are packed into chunks. Only those chunks that don’t exist yet, will be save/uploaded to the storage.
Neither of the two: (Timestamp selection, as I understand the current process)
Duplicacy selects all files (from the file selection set) for backup, whose timestamp does not match the timestamp in the storage. All selected files are packed into chunks. Only those chunks that don’t exist yet, will be save/uploaded to the storage.

I think that would make the behavior more transparent, keep the current behavior (ideally with a new option name and the current “-hash” option name being slowly deprecated) and also add the file hash behavior that I was expecting.

Droolio · 29 April 2020 13:44

Aside from the benefits or otherwise, of an alternative ‘new-file’ detection system, I would think it would be highly unwise to rename a flag that is already well documented and quite frankly perfectly descriptive of what it’s doing… hashing contents rather than relying on metadata.

Secondly, your alternative system still has a potentially big penalty associated with it. Duplicacy would have to hash the entire file before deciding if it’s new. If it’s new, it has to go back and then chunk and hash again.

It also wouldn’t clean up old remnant chunks, because it’s interrupting the normal chunk and pack process - treating the repository as one big tar, with potentially many small files occupying a chunk. Some of those files may no longer be packed together as they should - losing even more efficiency.

The source is on Github, you’re welcome to add a pre-file-hash check and test it out, but I honestly think it’ll perform much worse than you think. And it would dependent highly on the types of data and how it changes over time. Personally, my main motivation for using -hash (which is only occasional anyway) is making sure no files have been missed (e.g. Excel files), plus to clean out old remnant bit of files.

If you only want to save on CPU time, there’s probably a better solution. Store also the pre-encrypted chunk hashes along with the metadata. As it’s chunking and hashing files, it can skip the compress-and-encrypt stage when it sees a pre-encrypted hash from an earlier backup.