Check command - is this execution time typical/expected?

I am OK if the answer is yes, I just want to know if I have run this correctly. Running the Check command from the WebUI

6.5TB on Google Cloud takes over 4 hours to check

2020-05-28 07:00:01.508 INFO STORAGE_SET Storage set to gcd://backup
2020-05-28 07:00:04.260 INFO SNAPSHOT_CHECK Listing all chunks
2020-05-28 11:07:19.576 INFO SNAPSHOT_CHECK 6 snapshots and 31 revisions
2020-05-28 11:07:19.629 INFO SNAPSHOT_CHECK Total chunk size is 6475G in 1350877 chunks

Can you add -d as the global option to turn on debug logging? This will tell if Google is rate limiting the API requests.

It ran this morning with -d. Looks like the time is all in the Listing Chunks. Once that is complete the process completes in a couple minutes

Running check command from /cache/localhost/all
Options: [-log -d check -storage duplicacy -a -tabular]
2020-05-29 **07:00**:01.262 INFO STORAGE_SET Storage set to gcd://backup
2020-05-29 07:00:01.349 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_DUPLICACY_GCD_TOKEN
2020-05-29 07:00:03.571 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_DUPLICACY_GCD_TOKEN
2020-05-29 07:00:03.571 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_DUPLICACY_PASSWORD
2020-05-29 07:00:04.415 TRACE CONFIG_ITERATIONS Using 16384 iterations for key derivation
2020-05-29 07:00:04.610 DEBUG STORAGE_NESTING Chunk read levels: [1], write level: 1
2020-05-29 07:00:04.620 INFO CONFIG_INFO Compression level: 100
2020-05-29 07:00:04.620 INFO CONFIG_INFO Average chunk size: 4194304
2020-05-29 07:00:04.620 INFO CONFIG_INFO Maximum chunk size: 16777216
2020-05-29 07:00:04.620 INFO CONFIG_INFO Minimum chunk size: 1048576
2020-05-29 07:00:04.620 INFO CONFIG_INFO Chunk seed: <DELETED>
2020-05-29 07:00:04.620 TRACE CONFIG_INFO Hash key: <DELETED>
2020-05-29 07:00:04.620 TRACE CONFIG_INFO ID key: <DELETED>
2020-05-29 07:00:04.620 TRACE CONFIG_INFO File chunks are encrypted
2020-05-29 07:00:04.620 TRACE CONFIG_INFO Metadata chunks are encrypted
2020-05-29 07:00:04.620 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_DUPLICACY_PASSWORD
2020-05-29 07:00:04.631 DEBUG LIST_PARAMETERS id: , revisions: [], tag: , showStatistics: false, showTabular: true, checkFiles: false, searchFossils: false, resurrect: false
**2020-05-29 07:00:04.631 INFO SNAPSHOT_CHECK Listing all chunks**
**2020-05-29 07:00:04.631 TRACE LIST_FILES Listing chunks/**
**2020-05-29 **10:56:58**.042 TRACE SNAPSHOT_LIST_IDS Listing all snapshot ids**
2020-05-29 10:56:58.281 TRACE SNAPSHOT_LIST_REVISIONS Listing revisions for snapshot tv
2020-05-29 10:56:58.467 TRACE SNAPSHOT_LIST_REVISIONS Listing revisions for snapshot movies
2020-05-29 10:56:58.916 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/movies/1 from the snapshot cache
2020-05-29 10:56:59.097 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/movies/2 from the snapshot cache
2020-05-29 10:56:59.342 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/movies/3 from the snapshot cache
2020-05-29 10:56:59.544 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/movies/4 from the snapshot cache
2020-05-29 10:56:59.723 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/movies/5 from the snapshot cache
2020-05-29 10:56:59.938 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/movies/6 from the snapshot cache
2020-05-29 10:56:59.938 TRACE SNAPSHOT_LIST_REVISIONS Listing revisions for snapshot audiobooks
2020-05-29 10:57:00.653 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/audiobooks/1 from the snapshot cache
2020-05-29 10:57:00.829 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/audiobooks/2 from the snapshot cache
2020-05-29 10:57:01.022 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/audiobooks/3 from the snapshot cache
2020-05-29 10:57:01.196 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/audiobooks/4 from the snapshot cache
2020-05-29 10:57:01.498 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/audiobooks/5 from the snapshot cache
2020-05-29 10:57:01.815 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/audiobooks/6 from the snapshot cache
2020-05-29 10:57:02.135 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/audiobooks/7 from the snapshot cache
2020-05-29 10:57:02.329 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/audiobooks/8 from the snapshot cache
2020-05-29 10:57:02.595 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/audiobooks/9 from the snapshot cache
2020-05-29 10:57:02.889 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/audiobooks/10 from the snapshot cache
2020-05-29 10:57:03.141 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/audiobooks/11 from the snapshot cache
2020-05-29 10:57:03.141 TRACE SNAPSHOT_LIST_REVISIONS Listing revisions for snapshot bootcamp
2020-05-29 10:57:07.788 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/bootcamp/1 from the snapshot cache
2020-05-29 10:57:07.788 TRACE SNAPSHOT_LIST_REVISIONS Listing revisions for snapshot wedding
2020-05-29 10:57:08.089 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/wedding/1 from the snapshot cache
2020-05-29 10:57:08.090 TRACE SNAPSHOT_LIST_REVISIONS Listing revisions for snapshot docker
2020-05-29 10:57:08.478 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/docker/4 from the snapshot cache
2020-05-29 10:57:08.660 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/docker/5 from 
.
.
.
LOTS MORE OF THIS
.
.
2020-05-29 **10:59:34**.226 INFO SNAPSHOT_CHECK

I’m afraid that’s typical for Google Cloud Drive. My chunk enumeration for a 50GB backup takes 5.5 minutes, the same copy on Backblaze B2 on only takes 11s.

Realistically GCD isn’t designed to support this kind of usage (as a backup destination). I dislike it for a 50GB backup and I’d hate it for a 6TB backup.

Perhaps G-Suite would be a better service…

2 Likes

This is using my G-Suite …

Same deal with g-suite. Both paid and grandfathered free, at least in my experience. Those services are optimized for user documents storage, not a million files data stores so the design is either not optimized for that or made slow on purpose to discourage such use.

I’m wondering what if you try google file stream and backup to local virtual drive? I’d expect the metadata to get synced once and then it should be as fast as on local drive. Worth a shot.

3 Likes

I thought of that, but I am running the Docker container in my Synology NAS. I don’t think I can connect Google File Stream there. I tried sharing the drive from my windows machine but I didn’t see the backup folder from the NAS.

I’ll try to find if there is a way to speed it up.

2 Likes

Ok thanks. I think I have hit some kind of breaking point here. 8.8TB in 1.8 million chunks is taking over 8 hours to check.

2020-06-02 07:00:01.443 INFO STORAGE_SET Storage set to gcd://backup
2020-06-02 07:00:05.972 INFO SNAPSHOT_CHECK Listing all chunks
2020-06-02 15:22:48.743 INFO SNAPSHOT_CHECK 6 snapshots and 34 revisions
2020-06-02 15:22:48.988 INFO SNAPSHOT_CHECK Total chunk size is 8803G in 1837337 chunks

Is there any way to multi-thread the ListAllFiles() bit - not just for Google Drive but also for other storages (especially SFTP)? This has always been the slowest part of Duplicacy for me, when doing stuff like check, prune, but also copy.

2 Likes

This commit uses multiple threads to list the chunks directory in the Google Drive backend:

For the check command, the number of threads is taken from the -threads option.

I ran a test on a storage containing 988380 chunks. Here are the results showing the speedups:

Threads Listing time
1 48m46s
4 9m22s
8 4m20s
16 2m16s

It looks like Google doesn’t rate limit the listing API.

4 Likes

Wow, that’s pretty impressive. Thanks very much for that commit - hope to test it out soon!..

I am very excited to try this! I tried to build it locally from the latest source but I got an error

@:~/go/src/github.com/gilbertchen/duplicacy$ go build duplicacy/duplicacy_main.go
# github.com/minio/blake2b-simd
../../minio/blake2b-simd/compress_amd64.go:21:5: error: reference to undefined name ‘avx2’
  if avx2 {
     ^
../../minio/blake2b-simd/compress_amd64.go:23:12: error: reference to undefined name ‘avx’
  } else if avx {
            ^
../../minio/blake2b-simd/compress_amd64.go:25:12: error: reference to undefined name ‘ssse3’
  } else if ssse3 {
            ^

I am using saspus’s docker image for the webui. Once this makes it in there I will test it out.

wow, I wave the same problem with wasabi.
Check operations take the longest to complete.
Almost 1h to check 500 GB.
@gchen could we implement multiple threads to check s3 storage as well? :heart_eyes:

The s3 backend can list the chunks directory recursively and it is already very efficient, so there is no need to use multiple threads.

On my 235G wasabi storage it only take 23 seconds to list 48K chunks:

2020-06-15 21:00:08.600 TRACE LIST_FILES Listing chunks/
2020-06-15 21:00:31.444 TRACE SNAPSHOT_LIST_IDS Listing all snapshot ids
...
2020-06-15 21:00:35.822 INFO SNAPSHOT_CHECK Total chunk size is 235,363M in 47917 chunks

That’s strange,
For me with about 500 GB it takes 30 mins to list 98451 chunks and 30 additional minutes to check all snapshots and revisions.
So a total of 1h.
what could be wrong @gchen?.

Just ran check -d:

2020-06-17 18:46:55.515 TRACE LIST_FILES Listing chunks/
2020-06-17 18:47:39.364 TRACE SNAPSHOT_LIST_IDS Listing all snapshot ids
2020-06-17 19:18:40.592 INFO SNAPSHOT_CHECK Total chunk size is 427,554M in 105613 chunks

It takes 31 minutes. for 432 GB

What am I doing wrong?

Maybe nothing. Which region are you using?

As another wasabi data point (us-west-1 region, -log check -fossils -resurrect -a -tabular), listing all chunks takes me a little more than 4 minutes for ~370000 chunks (~1.6 TB). The rest of the check operation takes ~8 minutes.

Hmm the time between ‘Listing chunks’ and ‘Listing all snapshot ids’ in your log is 44 seconds.

How many snapshots does your storage have? Coz my recent GCD check, with -d, has a line in there about this, before the ‘Total chunk size’ bit:

2020-06-16 15:48:30.272 INFO SNAPSHOT_CHECK Listing all chunks
2020-06-16 16:41:00.007 INFO SNAPSHOT_CHECK 13 snapshots and 1910 revisions
2020-06-16 16:41:00.011 INFO SNAPSHOT_CHECK Total chunk size is 1251G in 320761 chunks

I’m using wasabi us-east-1