Why do -hash backups show every single file being uploaded?

dtownsend · 19 February 2019 02:04

I’ve switched to using -hash backups since it seems more accurate and it doesn’t take much time for me. One thing that confuses me though is that in the log I see “INFO UPLOAD_FILE” for every single file, even ones that definitely haven’t changed since the last backup. In the summary it does say that only a few new chunks were uploaded, but it also says that every file was new. Is this some logging issue or what is going on here?

TheBestPessimist · 19 February 2019 06:03

What -hash is doing is taking each file and reading it from the disk. This means that doesn’t just skip over files which are not modified on the disk from the latest backup, but actually opens the file and reads all its contents!

That’s the reason you see the info message: because opens the file, and the logger announces you what it does (albeit with the message “UPLOAD_FILE”, but it still announces you).

Why you don’t see many uploaded files is also pretty simple: the files haven’t been modified so their chunks already exist on the storage – so there’s no need to reupload anything.

My suggestion is you shouldn’t use -hash since all it does in most cases is just slowdown your backups and uses the HDD more than it should be. If there are programs which don’t update the Last Modified timestamp, then those programs should be shamed and you should complain to their creator.

dtownsend · 19 February 2019 15:40

That’s nice in theory but what you’re saying when you say that in most cases -hash doesn’t add any benefit is that in some cases it does. I have needed backups for disaster recovery a bunch of times, I rely on any backup tool I use to actually back up all my data so I’m not going to configure it in a way that makes that less likely.

I guess I’ll just fix duplicacy to do the right thing here.

Droolio · 19 February 2019 23:24

Not using -hash doesn’t make it less likely to include files that have already been backed up. Those files, if the modified date hasn’t changed or if the file hasn’t been deleted, will always be automatically included in the next backup…

I believe the only gotcha is when files are actually modified but the timestamp and size doesn’t change. AFAIK, only old versions of Excel do this (maybe there’s a few other bad programs out there that does but I don’t know of any). Conversely, if a file somehow got corrupted over time, doing backups without -hash turns out to be better because the backed-up copy remains unaffected.

Anyway, I agree there’s probably an issue with the log levels. Does it say “Uploaded blah” in the UPLOAD_FILE line, or does it say “Uploaded file blah”? I’ve been meaning to turn on -hash for my own backups, but only once a week, and the rest without, which I think is a good compromise.

towerbr · 20 February 2019 01:55

For information only, Veracrypt and some other programs that create encrypted containers have the option of not updating the timestamp. In fact, if I remember correctly, this is the default option in Veracrypt.

dtownsend · 20 February 2019 02:50

I’ve seen this crop up a few times over my time working with computers. Yes it is rare but regardless I try to defend against the worst case scenario when setting up my backups.

It says INFO UPLOAD_FILE Uploaded <filename> for every single file in the repository.

It also ends with, for example:

INFO BACKUP_END Backup for /Users/dave at revision 21 completed
INFO BACKUP_STATS Files: 157639 total, 10,467M bytes; 157639 new, 10,467M bytes
INFO BACKUP_STATS File chunks: 2093 total, 10,467M bytes; 40 new, 222,141K bytes, 124,430K bytes uploaded
INFO BACKUP_STATS Metadata chunks: 13 total, 54,485K bytes; 4 new, 14,533K bytes, 5,449K bytes uploaded
INFO BACKUP_STATS All chunks: 2106 total, 10,520M bytes; 44 new, 236,675K bytes, 129,880K bytes uploaded
INFO BACKUP_STATS Total running time: 00:02:06

You can see that the count and size of new files is wrong.

I think I can figure out enough Go to fix this myself though.

Droolio · 20 February 2019 13:20

Hmmm, you’re right. The ‘157639 new’ is obviously wrong there - seems like it’s using a different definition of ‘new’. Could well be a bug.

Anyway, the section of code that spits out those UPLOAD_FILE lines suggests this only happens when -stats is used? If so, I’m not sure why this extra debugging is even needed:

github.com

gilbertchen/duplicacy/blob/bebd7c4b7767a09c6906240e0d196268cbcdc4f2/src/duplicacy_backupmanager.go#L630-L634


	if showStatistics && !RunInBackground {
		for _, entry := range uploadedEntries {
			LOG_INFO("UPLOAD_FILE", "Uploaded %s (%d)", entry.Path, entry.Size)
		}
	}

Christoph · 20 February 2019 22:14

Is your complaint the same as mine?

So, I guess what I’m asking is: is this really specific to the -hash option?

dtownsend · 20 February 2019 22:49

I only see my issue on -hash backups.

dtownsend · 21 February 2019 06:31

This seems to do the trick: Only log changed files when backing up with -hash. by Mossop · Pull Request #545 · gilbertchen/duplicacy · GitHub

gchen · 21 February 2019 19:29

-hash doesn’t mean file comparison by hashes. Rather, it means that all files are treated as new files. So Duplicacy would just blindly split and upload every file. In this sense those log messages and stats are correct.

Note that to do hash comparison you need to scan local files to get local hashes. In the worst case, you’ll need to scan the file once, only to find out that the hash has changed so you’ll need another pass to split the file into chunks. In the current implement only one pass is needed.

dtownsend · 21 February 2019 22:19

I disagree. Sure that is how the code works, but my understanding of the words “Uploaded” and “new” have meanings that diverge from this. To me “Uploaded” means that some or all of the file’s content was transferred to the backup. But for a file that hasn’t changed since the last upload nothing will get changed as the file’s chunks will already be present in the backup. A “new” file is one that has never been backed up before.

Droolio · 22 February 2019 03:31

I think this comes down to a matter of definitions.

I haven’t followed the code very closely but it seems your PR is superfluous in the sense that, the current implementation is doing all the proper things, but only the stats summary at the end conflicts with your definition of ‘new’ (in the first line at least).

Even if we agree these numbers feels wrong, why not address just the counts and sizes in the stats? Your code seems to go beyond that.

After seeing gchen’s explanation of how Duplicacy works with -hash, it makes sense that, in order to determine if something is ‘new’, it has to treat it as a potentially new file regardless. For performance, it does this in one pass. Meanwhile uploading chunks, if necessary. Only after it’s fully read the file will it know for certain any of it’s changed. This already seems to be counted and accurately shown on subsequent stats lines in the log.

But for sure, seeing 157639 new kinda seems unintuitive.

So, for a -hash job only, maybe instead of ‘new’ it should instead say ‘read’ or ‘hashed’? Or maybe omit new completely since this should(?) always be the same as the prior mentioned file count and size. Though perhaps there’s some added reassurance to see these numbers are the same.

In fact, why not change ‘new’ on the first stats line to ‘read’ or ‘hashed’ in both -hash and non-hash runs? This would solve the problem in one line of code.

dtownsend · 22 February 2019 04:29

Other than correcting a few typos in comments that I found along the way, all my code does is address the counts and sizes in the stats and the number of Uploaded file lines it prints.

Yes, when I was reading the code and understood that I agreed it was the best way to do it and didn’t want to change that. Which is why my PR only counts the stats after that first pass is complete so there should be no performance cost.

After the pass we can know for sure whether the file was changed or not, but the existing code doesn’t seem to check that. It just adds every file, unchanged, new, or modified to uploadedEntries and considers those to be the new files when printing out the stats. This is actually true even when -hash isn’t used. If a file changes date but its contents don’t change in any way it is still listed as “new” in the final stats.

That would be good to do I think, it would certainly make the logs make more sense to me.

Unfortunately though that would not actually solve the core problem that I’m trying to fix, which is to understand how much new data was found during the backup and what files it came from.

Anyway, it sounds like I have different expectations for Duplicacy here so I’ll just have to figure out if there is an alternate way to do what I need here.

towerbr · 22 February 2019 12:12

I understand that this does not solve the whole problem pointed out by @dtownsend , but I think this little change makes sense and, more than that, it is necessary for the log stats to be correct, right?

Some of the doubts here in this topic look like my doubts in this old topic:

Christoph · 22 February 2019 21:04

Sorry to bring my old quibble up again, but this sounds exacly like what I said in the previously mentioned other topic:

So even if your concern is only with the specific case of the -hash option, is this not a more general problem?