Duplicacy check taking unreasonably long time

new_to_dupli · 11 June 2019 03:38

I have a hierarchy of about 270k files, taking about 20 GB (many small files)
On Linux (Unbuntu 19.04, ext4), it takes 7 minutes to back them up and 6 minutes to restore them. That’s very fast.
But to check the storage with duplicacy --files --stats, it takes 2 hours! For only one revision. Without --files it’s only a tenth of a second.

What could --files be doing that takes so long? My understanding is that it reads the contents of each file from the archive, computes the hash on it and checks it against the hash stored for the file in the backed up archive. How could that possibly take that long, when the backup operation itself which also, fully reads the contents of each files and calculates the hash, in order to store them in the archive, only takes 7 minutes? Granted it reads the file contents from the file system rather than the duplicacy archive, butit shouldn’t be that big of a difference, since the restore operation, that reads file contents from the archive, also takes very little time?

For the same files, stored on NTFS and running duplicacy on Windows, I get 18 mins for the backup, 32 mins to restore, but over 2 hours to check. I’m fine with the backup and restore times being a bit longer than on Linux, but again the check is unjustifiably slow.

Also, for the same files, on Linux, borg check with --verify-data (the slowest option) takes only 3.5 minutes (and similar backup/restore times to duplicacy).

This is all backing up from an SSD onto another local drive, so no network/internet download/upload involved.

Here is the output on Linux:

meubuntu@meubuntu-Virtual-Machine:~/mine/Me$ time duplicacy check --files --stats
		Storage set to /home/meubuntu/mine/duplicacy_backup1_native
		Listing all chunks
		1 snapshots and 1 revisions
		Total chunk size is 10,783M in 3125 chunks
		All files in snapshot e_me_native at revision 1 have been successfully verified
		Snapshot e_me_native at revision 1: 270090 files (19,553M bytes), 10,783M total chunk bytes, 0 unique chunk bytes
		Snapshot e_me_native all revisions: 10,783M total chunk bytes, 10,783M unique chunk bytes

		real	120m31.079s
		user	132m53.507s
		sys	3m15.797s


meubuntu@meubuntu-Virtual-Machine:~/mine/Me$ time duplicacy check --stats
		Storage set to /home/meubuntu/mine/duplicacy_backup1_native
		Listing all chunks
		1 snapshots and 1 revisions
		Total chunk size is 10,783M in 3125 chunks
		All chunks referenced by snapshot e_me_native at revision 1 exist
		Snapshot e_me_native at revision 1: 270090 files (19,553M bytes), 10,783M total chunk bytes, 10,783M unique chunk bytes
		Snapshot e_me_native all revisions: 10,783M total chunk bytes, 10,783M unique chunk bytes

		real	0m0.147s
		user	0m0.063s
		sys	0m0.039s

new_to_dupli · 11 June 2019 20:19

I repeated this with a folder of only 7 big files, but also totalling about 20 GB like in the previous experiment. Check with --files took 5 minutes (backup took 6 mins). So it’s definitely got to do with the number of files. You would think that restore would also be affected by the number fo files, but nope, it runs very fast.
Is there a check of the hash for each file performed when the restore command is run? Cause in that case one could take the restore routine, comment out the code that actually creates the file, and voila, you have a check command that takes under 6 minutes, instead of 2 hours.

For what it’s worth the resulting duplicacy repo has 3,146 files (mostly the chunks) for the case with the many small files, and 4,779 files for the 7 big files case. There was some duplication in the first case. But either way it looks like the slowness in the case of many files doesn’t have to do with the number of chunks having to be read, but with the files stored in them.

leerspace · 11 June 2019 22:54

I think this issue fom github might be related, especially since you have a lot of files.

This one might be related if you have a bunch of revisions.

new_to_dupli · 12 June 2019 05:09

Thank you, they definitely seem related, even though my backups are done on a local drive, so downloading the same chunk repeatedly, could be, in my case, loading it into RAM repeatedly.
For the experiments that I posted, I only had one revision, but I did make a test with two revisions, where the second only has one or two changes, and in one check command for the whole storage, still each revision is checked independently and each takes 2 hours.

But this I think is a different issue from the one where chunks are loaded repeatedly inside the checking of one revision.

gchen · 13 June 2019 18:57

There is a bug in check -files that caused one chunk to be downloaded multiple times, if the chunk is shared by multiple small files. I just checked in a commit to fix it: Check -files may download a chunk multple times · gilbertchen/duplicacy@4da7f7b · GitHub

new_to_dupli · 13 June 2019 23:10

Thanks for looking at it and fixing it! I tested with the current master and the check takes 2 minutes now, for one revision, complete with the --files options.
I looked a bit at the code and it seems that the implementation depends on the files being checked in the order in which their contents is stored in a chunk. If they were to be checked in some other different order, the chunk would be again loaded over and over, if a different chunk has been loaded in the meantime. It seems a little brittle, so hopefully this doesn’t break with later code changes.

If I have two revisions, the second being an incremental with zero changes, it takes another 2 minutes. So that problem hasn’t been fixed, of the same file being checked once for each revision instead of just once overall. Should I post a new forum thread, or is the ticket in github enough to report that?

Also is it worth documenting whether a restore operation does a check on each file’s hash and warn to the user if it doesn’t match, or is it up to the user to run a check --files on the arvhive before doing a restore?

TheBestPessimist · 14 June 2019 04:57

I guess the optimal solution here would be to store in memory a “visited map” with key=chunk and value = true/false representing “the chunk exists”. Not sure about the memory usage, but i would expect 1_000_000_000_000 strings +bool to not consume that much memory (maybe i remember wrong but that chunk list is already created in “listing all the chunks” phase)

gchen · 14 June 2019 13:57

No, the file list is always sorted by file names so files will always be checked in the same order.

Once a revision passed check -files, you won’t need to check it again, so in most cases you don’t need to run check -files on multiple revisions at once. If you do, the most efficient method is to add a local storage, copy revisions to be checked to the local storage, and then run the check on the local storage. This will guarantee to download the minimum number of chunks.

Yes, a restore operation will error out if the file hashes do not match. No need to run check -files beforehand.