[perf] GCD `Listing all chunks` step is very very slow


#1

Later edit: This is a known issue: Duplicacy 'check' command can be very slow · Issue #397 · gilbertchen/duplicacy · GitHub


I have a GCD storage with around 5TB of data.

Each time i run backup, check or prune duplicacy hangs at the step Listing all chunks for more than 1 hour.
CPU utilisation is 0, network usage is 0, only the RAM usage increases as more chunks are listed from the storage.

I don’t know if all the storages have this issue or it’s just gdrive but is there a way to speedup this step? Maybe (if it’s not used) use batching?

Here’s the first backup of a new repository for reference:

It's a long debug output
./z.exe -d -log backup -stats -threads 20
2019-01-07 12:05:30.124 INFO STORAGE_SET Storage set to gcd://backups/duplicacy
2019-01-07 12:05:30.124 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_GCD_TOKEN
2019-01-07 12:05:30.125 DEBUG PASSWORD_PREFERENCE Reading gcd_token from preferences
2019-01-07 12:05:33.072 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_GCD_TOKEN
2019-01-07 12:05:33.073 DEBUG PASSWORD_PREFERENCE Reading gcd_token from preferences
2019-01-07 12:05:33.073 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_PASSWORD
2019-01-07 12:05:33.073 DEBUG PASSWORD_PREFERENCE Reading password from preferences
2019-01-07 12:05:33.880 TRACE CONFIG_ITERATIONS Using 16384 iterations for key derivation
2019-01-07 12:05:34.498 DEBUG STORAGE_NESTING Chunk read levels: [1], write level: 1
2019-01-07 12:05:34.525 INFO CONFIG_INFO Compression level: 100
2019-01-07 12:05:34.526 INFO CONFIG_INFO Average chunk size: 4194304
2019-01-07 12:05:34.526 INFO CONFIG_INFO Maximum chunk size: 16777216
2019-01-07 12:05:34.526 INFO CONFIG_INFO Minimum chunk size: 1048576
2019-01-07 12:05:34.526 INFO CONFIG_INFO Chunk seed: hue
2019-01-07 12:05:34.526 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_PASSWORD
2019-01-07 12:05:34.527 DEBUG PASSWORD_PREFERENCE Reading password from preferences
2019-01-07 12:05:34.527 DEBUG BACKUP_PARAMETERS top: /root/duplicacy-repositories/chat.tbp.land, quick: true, tag:
2019-01-07 12:05:34.527 TRACE SNAPSHOT_DOWNLOAD_LATEST Downloading latest revision for snapshot chat.tbp.land
2019-01-07 12:05:34.527 TRACE SNAPSHOT_LIST_REVISIONS Listing revisions for snapshot chat.tbp.land
2019-01-07 12:05:35.261 INFO BACKUP_START No previous backup found
2019-01-07 12:05:35.262 INFO BACKUP_INDEXING Indexing /root/duplicacy-repositories/chat.tbp.land
2019-01-07 12:05:35.262 DEBUG LIST_ENTRIES Listing
2019-01-07 12:05:35.263 DEBUG INCOMPLETE_LOCATE Failed to locate incomplete snapshot: open /root/duplicacy-repositories/chat.tbp.land/.duplicacy/incomplete: no such file or directory
2019-01-07 12:05:35.263 INFO BACKUP_LIST Listing all chunks
2019-01-07 12:05:35.263 TRACE LIST_FILES Listing chunks/
[...............................................waiting for 1 hour..........................................]
2019-01-07 13:09:58.267 TRACE PACK_START Packing z.exe
2019-01-07 13:09:58.268 INFO BACKUP_THREADS Use 20 uploading threads
2019-01-07 13:09:58.901 INFO PACK_END Packed z.exe (26327201)
2019-01-07 13:10:00.537 DEBUG CHUNK_UPLOAD Chunk 67033008952f94ab4931bb16f5dfa49ce7e76cceb689d77ab0391ac44545ae58 has been uploaded
2019-01-07 13:10:00.537 INFO UPLOAD_PROGRESS Uploaded chunk 2 size 1929475, 942KB/s 00:00:26 7.3%
2019-01-07 13:10:01.725 DEBUG CHUNK_UPLOAD Chunk 997f6590e8072553ba3ddbc27a5425afad5627c712d426bc459d5ad3b0ce6d33 has been uploaded
2019-01-07 13:10:01.726 INFO UPLOAD_PROGRESS Uploaded chunk 3 size 12246480, 4.51MB/s 00:00:03 53.8%
2019-01-07 13:10:01.726 DEBUG CHUNK_UPLOAD Chunk 85b900d5eb6ec8fdd1f040d869b839b57172fa45c4d7ed62742204636063a200 has been uploaded
2019-01-07 13:10:01.730 INFO UPLOAD_PROGRESS Uploaded chunk 1 size 3235212, 5.53MB/s 00:00:02 66.1%
2019-01-07 13:10:01.726 DEBUG CHUNK_UPLOAD Chunk 887469c3e5c66253bdcece8c4b8c4a274ca00973f03e6c7c0c85520fd05b37b6 has been uploaded
2019-01-07 13:10:01.731 INFO UPLOAD_PROGRESS Uploaded chunk 5 size 3675831, 6.70MB/s 00:00:01 80.0%
2019-01-07 13:10:02.030 DEBUG CHUNK_UPLOAD Chunk 4582dde83a9d818e952ea17051475a8d936d2f58483560ec6e4a751dea9eb8d2 has been uploaded
2019-01-07 13:10:02.030 INFO UPLOAD_PROGRESS Uploaded chunk 6 size 2924559, 5.72MB/s 00:00:01 91.2%
2019-01-07 13:10:02.031 DEBUG CHUNK_UPLOAD Chunk acd8795ae1d7214f7b89a8f06138adff320313b08a29521b7badbc2b5477d212 has been uploaded
2019-01-07 13:10:02.031 INFO UPLOAD_PROGRESS Uploaded chunk 4 size 2315644, 6.28MB/s 00:00:01 100.0%
2019-01-07 13:10:02.093 DEBUG CHUNK_CACHE Chunk 723ba6b35f29ad2b2fc4014c1bc504d416173653fef149ccb6d5041976fcb05e has been saved to the snapshot cache
2019-01-07 13:10:02.094 DEBUG CHUNK_CACHE Chunk 09277b99e21165a5d240796cf6f5b7e785f482bfdb4a3f25aa51c0c5f78b3532 has been saved to the snapshot cache
2019-01-07 13:10:02.095 DEBUG CHUNK_CACHE Chunk d1ac0118ae56f8a7fc75e7e2dfd6d0df48180e36a7452db8ebfecaa7822699ad has been saved to the snapshot cache
2019-01-07 13:10:03.807 DEBUG CHUNK_UPLOAD Chunk 09277b99e21165a5d240796cf6f5b7e785f482bfdb4a3f25aa51c0c5f78b3532 has been uploaded
2019-01-07 13:10:04.192 DEBUG CHUNK_UPLOAD Chunk d1ac0118ae56f8a7fc75e7e2dfd6d0df48180e36a7452db8ebfecaa7822699ad has been uploaded
2019-01-07 13:10:04.193 DEBUG CHUNK_UPLOAD Chunk 723ba6b35f29ad2b2fc4014c1bc504d416173653fef149ccb6d5041976fcb05e has been uploaded
2019-01-07 13:10:04.197 DEBUG UPLOAD_FILE_CACHE Saved file snapshots/chat.tbp.land/1 to the snapshot cache
2019-01-07 13:10:05.615 DEBUG UPLOAD_FILE Uploaded file snapshots/chat.tbp.land/1
2019-01-07 13:10:05.615 INFO UPLOAD_FILE Uploaded z.exe (26327201)
2019-01-07 13:10:05.616 TRACE LIST_FILES Listing snapshots/
2019-01-07 13:10:05.617 TRACE LIST_FILES Listing snapshots/chat.tbp.land/
2019-01-07 13:10:05.618 DEBUG SNAPSHOT_CLEAN Snapshot chat.tbp.land revision 1 needs chunk 723ba6b35f29ad2b2fc4014c1bc504d416173653fef149ccb6d5041976fcb05e
2019-01-07 13:10:05.619 TRACE LIST_FILES Listing chunks/
2019-01-07 13:10:05.619 TRACE LIST_FILES Listing chunks/d1/
2019-01-07 13:10:05.619 TRACE LIST_FILES Listing chunks/72/
2019-01-07 13:10:05.620 TRACE LIST_FILES Listing chunks/09/
2019-01-07 13:10:05.620 INFO BACKUP_END Backup for /root/duplicacy-repositories/chat.tbp.land at revision 1 completed
2019-01-07 13:10:05.620 INFO BACKUP_STATS Files: 1 total, 25,710K bytes; 1 new, 25,710K bytes
2019-01-07 13:10:05.620 INFO BACKUP_STATS File chunks: 6 total, 25,710K bytes; 6 new, 25,710K bytes, 10,295K bytes uploaded
2019-01-07 13:10:05.621 INFO BACKUP_STATS Metadata chunks: 3 total, 635 bytes; 3 new, 635 bytes, 1K bytes uploaded
2019-01-07 13:10:05.621 INFO BACKUP_STATS All chunks: 9 total, 25,710K bytes; 9 new, 25,710K bytes, 10,297K bytes uploaded
2019-01-07 13:10:05.621 INFO BACKUP_STATS Total running time: 01:04:31

And here’s the check after that first backup:

Again full debug output
./z.exe -d -log check -tabular
2019-01-07 13:17:45.731 INFO STORAGE_SET Storage set to gcd://backups/duplicacy
2019-01-07 13:17:45.732 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_GCD_TOKEN
2019-01-07 13:17:45.733 DEBUG PASSWORD_PREFERENCE Reading gcd_token from preferences
2019-01-07 13:17:48.406 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_GCD_TOKEN
2019-01-07 13:17:48.406 DEBUG PASSWORD_PREFERENCE Reading gcd_token from preferences
2019-01-07 13:17:48.407 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_PASSWORD
2019-01-07 13:17:48.407 DEBUG PASSWORD_PREFERENCE Reading password from preferences
2019-01-07 13:17:49.400 TRACE CONFIG_ITERATIONS Using 16384 iterations for key derivation
2019-01-07 13:17:49.684 DEBUG STORAGE_NESTING Chunk read levels: [1], write level: 1
2019-01-07 13:17:49.709 INFO CONFIG_INFO Compression level: 100
2019-01-07 13:17:49.710 INFO CONFIG_INFO Average chunk size: 4194304
2019-01-07 13:17:49.710 INFO CONFIG_INFO Maximum chunk size: 16777216
2019-01-07 13:17:49.710 INFO CONFIG_INFO Minimum chunk size: 1048576
2019-01-07 13:17:49.710 INFO CONFIG_INFO Chunk seed: hue
2019-01-07 13:17:49.711 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_PASSWORD
2019-01-07 13:17:49.711 DEBUG PASSWORD_PREFERENCE Reading password from preferences
2019-01-07 13:17:49.711 DEBUG LIST_PARAMETERS id: chat.tbp.land, revisions: [], tag: , showStatistics: false, checkFiles: false, searchFossils: false, resurrect: false
2019-01-07 13:17:49.711 INFO SNAPSHOT_CHECK Listing all chunks
2019-01-07 13:17:49.711 TRACE LIST_FILES Listing chunks/
[...................................................]
2019-01-07 13:36:01.154 DEBUG GCD_RETRY [0] Get https://www.googleapis.com/drive/v3/files?alt=json&fields=nextPageToken%2Cfiles%28name%2C+mimeType%2C+id%2C+size%29&pageSize=1000&pageToken=hue&q=%271HdxxmkHiA87dhSUq08VTw3iUqcjK442N%27+in+parents+and+trashed+%3D+false+: unexpected EOF; retrying after 2.92 seconds (backoff: 2, attempts: 1)
[...................................................]
2019-01-07 14:02:12.045 TRACE SNAPSHOT_LIST_REVISIONS Listing revisions for snapshot chat.tbp.land
2019-01-07 14:02:12.822 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/chat.tbp.land/1 from the snapshot cache
2019-01-07 14:02:12.824 INFO SNAPSHOT_CHECK 1 snapshots and 1 revisions
2019-01-07 14:02:12.846 INFO SNAPSHOT_CHECK Total chunk size is 2439G in 522249 chunks
2019-01-07 14:02:12.850 DEBUG CHUNK_CACHE Chunk 723ba6b35f29ad2b2fc4014c1bc504d416173653fef149ccb6d5041976fcb05e has been loaded from the snapshot cache
2019-01-07 14:02:12.851 INFO SNAPSHOT_CHECK All chunks referenced by snapshot chat.tbp.land at revision 1 exist
2019-01-07 14:02:12.852 DEBUG DOWNLOAD_FETCH Fetching chunk 723ba6b35f29ad2b2fc4014c1bc504d416173653fef149ccb6d5041976fcb05e
2019-01-07 14:02:12.852 DEBUG CHUNK_CACHE Chunk 723ba6b35f29ad2b2fc4014c1bc504d416173653fef149ccb6d5041976fcb05e has been loaded from the snapshot cache
2019-01-07 14:02:12.853 INFO SNAPSHOT_CHECK
          snap | rev |                          | files |   bytes | chunks |   bytes | uniq |   bytes | new |   bytes |
 chat.tbp.land |   1 | @ 2019-01-07 12:05 -hash |     1 | 25,710K |      9 | 10,297K |    9 | 10,297K |   9 | 10,297K |
 chat.tbp.land | all |                          |       |         |      9 | 10,297K |    9 | 10,297K |     |         |

#2

Out of curiosity I’ve tried to lookup QPSs and QPDs for GCS and GCD and white there are a lot of specs posted for GCS there is not much for GCD and my impression is that the REST API Google provides for GCD is intended for very lightweight infrequent use – such as upload a picture or sync a document in the application and therefore the quotas and rates are designed to support this case and prevent abuse – which using GCD as a backup backend probably is. (Therefore there is no need to specify them in terms of requests per seconds type of metric – for the intended use that metric does not make much sense and probably that’s why there no information in the quotas and rates for this service)

This would be totally reasonable provided GCS is available with much more lax and fully published limits and performance to suit this type of use.

I had the same crap experience with other backup tools that support OneDrive and DropBox destinations as well (Synology’s HyperBackup being one of them). Performance ether sucked or requests (including and especially List/Enumerate) would time out; it was absolutely unusable with either of those services on anything but trivially sized datasets. And yet, the same software did not have any issues with Azure and S3 type of services.

Therefore while advertising support for those services there should be a disclaimer somewhere that those should not be use with “large” datasets. But people will ignore that anyway and then complain about poor performance.

In my opinion duplicacy (and all other backup tools for that matter) shall consider dropping support (Or at least hiding it under Advanced/Beware/YouOnYourOwnNow settings) for OneDrive, Google Drive and DropBox, (and perhaps few other “Drive services” (Amazon Drive being the exception, which seems to be ridiculously fast, but is now closed for third party integrations (wonder why :)) and also forever broken for duplicacy unfortunately) because (again, in my personal opinion) every feature should either provide ideal user experience or not be offered at all, especially those that are likely to be (ab)used by users – every one has GD or DropBox so that is what people would try first… I did…


#3

This is a known issue: Duplicacy 'check' command can be very slow · Issue #397 · gilbertchen/duplicacy · GitHub

Last time I checked, it looked like GCD didn’t rate limit the listing API call by returning 403 errors (Handling API Errors  |  Drive REST API  |  Google Developers); instead, they just throttled the connections to intentionally slow down the responses. So, the only way to speed it up is perhaps using multiple listing threads.


#4

Although I remember that issue, i didn’t remember it having the same reason, that’s why i created this issue on the forum.

I have noticed that Google Drive File Stream is extremely fast the first time i log in (eg. after installing on a new pc). By fast i mean that it lists/caches my gdrive folder (with all the 8TB of files) in less than 10 minutes. I don’t think that speed can come just from multithreading the list requests.

So although multithreading should be added (as it will help every storage backend), for gdrive i’m thinking that batching may also help.