Very many chunks in copy command - am I doing something wrong?

john2 · 4 May 2020 18:11

I’m pretty new to Duplicacy and - mostly - seem to have things figured out. However, I’m really confused about the copy command and how many chunks it wants to send.

My setup (for the purposes of this) is the classic 3-2-1. Onsite backups are performed to a minio instance on a local server. Once those are complete I run a copy command to onedrive. These are bit-identical copies (same encryption key, etc).

For the system in question there were a couple prior backups (revisions 1 and 2) with about 92500 chunks total. Those took a while to copy because the copy kept getting interrupted by network issues (retries would be really nice ).

It’s the third revision and copy that I’m using as the example. The third backup concludes this way:

2020-05-04 00:44:33.950 INFO BACKUP_END Backup for /Users/g at revision 3 completed
2020-05-04 00:44:33.950 INFO BACKUP_STATS Files: 2644683 total, 538,537M bytes; 1612620 new, 80,842M bytes
2020-05-04 00:44:33.950 INFO BACKUP_STATS File chunks: 111463 total, 538,837M bytes; 13627 new, 65,667M bytes, 51,430M bytes uploaded
2020-05-04 00:44:33.950 INFO BACKUP_STATS Metadata chunks: 465 total, 2,291M bytes; 299 new, 1,496M bytes, 605,098K bytes uploaded
2020-05-04 00:44:33.950 INFO BACKUP_STATS All chunks: 111928 total, 541,128M bytes; 13926 new, 67,163M bytes, 52,021M bytes uploaded
2020-05-04 00:44:33.950 INFO BACKUP_STATS Total running time: 03:27:26

So the backup has 111928 total chunks and 13926 were new. Fair enough (seems high really, but there are a lot of garbage files in there - Chrome cache, etc - that have yet to be weeded out, and this is on a Mac - which seems to love sqlite databases for things, which puffs up the change bytes quite a lot).

Then we get to the copy. I intentionally specified the specific snapshot and revision on the copy to limit things. The result ends this way:
Copy complete, 88855 total chunks, 13926 chunks copied, 74929 skipped

Chunks copied looks good - check! But why did it need to try 88855 chunks, skipping 74929, to do this? That took a considerable amount of time. Previous backups were all complete; shouldn’t it know the chunks are there? And how did we get to 88855, which doesn’t seem related to anything about the backup revision in question?

Thanks for help in understanding this - I tried to dig into the forums but didn’t see anything that shed much light on things.

tangofan · 4 May 2020 19:24

The copy command has to make sure that all chunks to be copied exist on the destination storage. It can’t rely on any local information, because you may have deleted a chunk on the destination storage by other means, e.g. from a different computer.

So it will try to copy all relevant chunks abd skip those that already exist.

john2 · 5 May 2020 04:28

I still can’t quite see how the count is 88855; the revision in question had 111928 chunks, so why check 74929 already-copied chunks but ignore 23073? On the rest - on the one hand, checking the storage against externally deleted chunks is certainly reasonable, but then one can make a reasonable case that it can’t really trust that the chunks on the destination storage are intact, so perhaps it should checksum them? There was another thread I did read which essentially said, Duplicacy needs to trust that the destination storage meets the expectations placed on storage - that it doesn’t delete or corrupt things.

So again, it seems like there’s a lot of wasted effort going on double-checking the storage to a certain level, but at the same time, it’s not rechecking about 1/5 of the chunks even so. I can see that a certain level of destination-storage paranoia might make one trust the backup that much more, but paranoia with what appears to be blind spots isn’t reassuring.

It’d be nice if the copy command could have e.g. a “trust-destination-storage” flag that could be turned off for an every-so-often double-check. In my particular case I know that no other computer will be deleting chunks (because all operations to Onedrive originate on the same intermediate system, the local backup server; none of the actual backup systems touch Onedrive) and I’m willing to take for granted that Microsoft won’t be clobbering them on a whim either.

john2 · 5 May 2020 04:47

I will also add that, with a decently fast internet connection (for the US at least), the time to push one client with modest revisions out to Onedrive is running 6 hours or so, most of which is spent checking on the existence of already-copied chunks. Since Duplicacy doesn’t seem to retry even simple network glitches, if that transfer is interrupted it has to start over, re-rechecking the chunks, and simple network glitches do seem to happen.

A few glitches or a few more clients and you can’t update external storage in a day. In other words, being paranoid about external storage has a real cost in delaying off-site replication; by being strict about controlling one type of risk, the delay introduces another type of risk.

As a middle ground (and yes, perhaps it would be really hard on some backends, but it’s at least a starting point for discussion), if Duplicacy fetched an entire listing of chunks in one go it would likely enormously speed up the skipping process while still being reasonably certain all the chunks were present.

towerbr · 5 May 2020 12:27

Executing the check command with -tabular option can give you a better view of the number of chunks (total and exclusive) and files in each revision.

I always use this option when I run a check command, it shows the “evolution” of revisions very clearly.

john2 · 5 May 2020 16:36

Thanks @towerbr - that pinned down the 88855 number (though of course in turn that makes me question the output of the backup command and why the numbers there don’t line up).

My other concerns remain though - it seems like there should be an option to trust the storage / trust that no other computer/source is removing chunks, and if that’s not feasible, pulling a full listing of chunks seems like it would speed things up enormously. My concern is that it won’t take much extra stress on backups to cause the offsite copy to take an unacceptably long time, especially with the copy stopping on any network glitch.

gchen · 5 May 2020 18:18

The OneDrive backend should list all chunks in the storage at the beginning of a copy command, so normally it won’t check the existence of each chunk one by one, unless OneDrive doesn’t return the full list.