[perf] GCD `Listing all chunks` step is very very slow

TheBestPessimist · 7 January 2019 11:45

Later edit: This is a known issue: Duplicacy 'check' command can be very slow · Issue #397 · gilbertchen/duplicacy · GitHub

I have a GCD storage with around 5TB of data.

Each time i run backup, check or prune duplicacy hangs at the step Listing all chunks for more than 1 hour.
CPU utilisation is 0, network usage is 0, only the RAM usage increases as more chunks are listed from the storage.

I don’t know if all the storages have this issue or it’s just gdrive but is there a way to speedup this step? Maybe (if it’s not used) use batching?

Here’s the first backup of a new repository for reference:

It's a long debug output

./z.exe -d -log backup -stats -threads 20
2019-01-07 12:05:30.124 INFO STORAGE_SET Storage set to gcd://backups/duplicacy
2019-01-07 12:05:30.124 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_GCD_TOKEN
2019-01-07 12:05:30.125 DEBUG PASSWORD_PREFERENCE Reading gcd_token from preferences
2019-01-07 12:05:33.072 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_GCD_TOKEN
2019-01-07 12:05:33.073 DEBUG PASSWORD_PREFERENCE Reading gcd_token from preferences
2019-01-07 12:05:33.073 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_PASSWORD
2019-01-07 12:05:33.073 DEBUG PASSWORD_PREFERENCE Reading password from preferences
2019-01-07 12:05:33.880 TRACE CONFIG_ITERATIONS Using 16384 iterations for key derivation
2019-01-07 12:05:34.498 DEBUG STORAGE_NESTING Chunk read levels: [1], write level: 1
2019-01-07 12:05:34.525 INFO CONFIG_INFO Compression level: 100
2019-01-07 12:05:34.526 INFO CONFIG_INFO Average chunk size: 4194304
2019-01-07 12:05:34.526 INFO CONFIG_INFO Maximum chunk size: 16777216
2019-01-07 12:05:34.526 INFO CONFIG_INFO Minimum chunk size: 1048576
2019-01-07 12:05:34.526 INFO CONFIG_INFO Chunk seed: hue
2019-01-07 12:05:34.526 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_PASSWORD
2019-01-07 12:05:34.527 DEBUG PASSWORD_PREFERENCE Reading password from preferences
2019-01-07 12:05:34.527 DEBUG BACKUP_PARAMETERS top: /root/duplicacy-repositories/chat.tbp.land, quick: true, tag:
2019-01-07 12:05:34.527 TRACE SNAPSHOT_DOWNLOAD_LATEST Downloading latest revision for snapshot chat.tbp.land
2019-01-07 12:05:34.527 TRACE SNAPSHOT_LIST_REVISIONS Listing revisions for snapshot chat.tbp.land
2019-01-07 12:05:35.261 INFO BACKUP_START No previous backup found
2019-01-07 12:05:35.262 INFO BACKUP_INDEXING Indexing /root/duplicacy-repositories/chat.tbp.land
2019-01-07 12:05:35.262 DEBUG LIST_ENTRIES Listing
2019-01-07 12:05:35.263 DEBUG INCOMPLETE_LOCATE Failed to locate incomplete snapshot: open /root/duplicacy-repositories/chat.tbp.land/.duplicacy/incomplete: no such file or directory
2019-01-07 12:05:35.263 INFO BACKUP_LIST Listing all chunks
2019-01-07 12:05:35.263 TRACE LIST_FILES Listing chunks/
[...............................................waiting for 1 hour..........................................]
2019-01-07 13:09:58.267 TRACE PACK_START Packing z.exe
2019-01-07 13:09:58.268 INFO BACKUP_THREADS Use 20 uploading threads
2019-01-07 13:09:58.901 INFO PACK_END Packed z.exe (26327201)
2019-01-07 13:10:00.537 DEBUG CHUNK_UPLOAD Chunk 67033008952f94ab4931bb16f5dfa49ce7e76cceb689d77ab0391ac44545ae58 has been uploaded
2019-01-07 13:10:00.537 INFO UPLOAD_PROGRESS Uploaded chunk 2 size 1929475, 942KB/s 00:00:26 7.3%
2019-01-07 13:10:01.725 DEBUG CHUNK_UPLOAD Chunk 997f6590e8072553ba3ddbc27a5425afad5627c712d426bc459d5ad3b0ce6d33 has been uploaded
2019-01-07 13:10:01.726 INFO UPLOAD_PROGRESS Uploaded chunk 3 size 12246480, 4.51MB/s 00:00:03 53.8%
2019-01-07 13:10:01.726 DEBUG CHUNK_UPLOAD Chunk 85b900d5eb6ec8fdd1f040d869b839b57172fa45c4d7ed62742204636063a200 has been uploaded
2019-01-07 13:10:01.730 INFO UPLOAD_PROGRESS Uploaded chunk 1 size 3235212, 5.53MB/s 00:00:02 66.1%
2019-01-07 13:10:01.726 DEBUG CHUNK_UPLOAD Chunk 887469c3e5c66253bdcece8c4b8c4a274ca00973f03e6c7c0c85520fd05b37b6 has been uploaded
2019-01-07 13:10:01.731 INFO UPLOAD_PROGRESS Uploaded chunk 5 size 3675831, 6.70MB/s 00:00:01 80.0%
2019-01-07 13:10:02.030 DEBUG CHUNK_UPLOAD Chunk 4582dde83a9d818e952ea17051475a8d936d2f58483560ec6e4a751dea9eb8d2 has been uploaded
2019-01-07 13:10:02.030 INFO UPLOAD_PROGRESS Uploaded chunk 6 size 2924559, 5.72MB/s 00:00:01 91.2%
2019-01-07 13:10:02.031 DEBUG CHUNK_UPLOAD Chunk acd8795ae1d7214f7b89a8f06138adff320313b08a29521b7badbc2b5477d212 has been uploaded
2019-01-07 13:10:02.031 INFO UPLOAD_PROGRESS Uploaded chunk 4 size 2315644, 6.28MB/s 00:00:01 100.0%
2019-01-07 13:10:02.093 DEBUG CHUNK_CACHE Chunk 723ba6b35f29ad2b2fc4014c1bc504d416173653fef149ccb6d5041976fcb05e has been saved to the snapshot cache
2019-01-07 13:10:02.094 DEBUG CHUNK_CACHE Chunk 09277b99e21165a5d240796cf6f5b7e785f482bfdb4a3f25aa51c0c5f78b3532 has been saved to the snapshot cache
2019-01-07 13:10:02.095 DEBUG CHUNK_CACHE Chunk d1ac0118ae56f8a7fc75e7e2dfd6d0df48180e36a7452db8ebfecaa7822699ad has been saved to the snapshot cache
2019-01-07 13:10:03.807 DEBUG CHUNK_UPLOAD Chunk 09277b99e21165a5d240796cf6f5b7e785f482bfdb4a3f25aa51c0c5f78b3532 has been uploaded
2019-01-07 13:10:04.192 DEBUG CHUNK_UPLOAD Chunk d1ac0118ae56f8a7fc75e7e2dfd6d0df48180e36a7452db8ebfecaa7822699ad has been uploaded
2019-01-07 13:10:04.193 DEBUG CHUNK_UPLOAD Chunk 723ba6b35f29ad2b2fc4014c1bc504d416173653fef149ccb6d5041976fcb05e has been uploaded
2019-01-07 13:10:04.197 DEBUG UPLOAD_FILE_CACHE Saved file snapshots/chat.tbp.land/1 to the snapshot cache
2019-01-07 13:10:05.615 DEBUG UPLOAD_FILE Uploaded file snapshots/chat.tbp.land/1
2019-01-07 13:10:05.615 INFO UPLOAD_FILE Uploaded z.exe (26327201)
2019-01-07 13:10:05.616 TRACE LIST_FILES Listing snapshots/
2019-01-07 13:10:05.617 TRACE LIST_FILES Listing snapshots/chat.tbp.land/
2019-01-07 13:10:05.618 DEBUG SNAPSHOT_CLEAN Snapshot chat.tbp.land revision 1 needs chunk 723ba6b35f29ad2b2fc4014c1bc504d416173653fef149ccb6d5041976fcb05e
2019-01-07 13:10:05.619 TRACE LIST_FILES Listing chunks/
2019-01-07 13:10:05.619 TRACE LIST_FILES Listing chunks/d1/
2019-01-07 13:10:05.619 TRACE LIST_FILES Listing chunks/72/
2019-01-07 13:10:05.620 TRACE LIST_FILES Listing chunks/09/
2019-01-07 13:10:05.620 INFO BACKUP_END Backup for /root/duplicacy-repositories/chat.tbp.land at revision 1 completed
2019-01-07 13:10:05.620 INFO BACKUP_STATS Files: 1 total, 25,710K bytes; 1 new, 25,710K bytes
2019-01-07 13:10:05.620 INFO BACKUP_STATS File chunks: 6 total, 25,710K bytes; 6 new, 25,710K bytes, 10,295K bytes uploaded
2019-01-07 13:10:05.621 INFO BACKUP_STATS Metadata chunks: 3 total, 635 bytes; 3 new, 635 bytes, 1K bytes uploaded
2019-01-07 13:10:05.621 INFO BACKUP_STATS All chunks: 9 total, 25,710K bytes; 9 new, 25,710K bytes, 10,297K bytes uploaded
2019-01-07 13:10:05.621 INFO BACKUP_STATS Total running time: 01:04:31

And here’s the check after that first backup:

Again full debug output

./z.exe -d -log check -tabular
2019-01-07 13:17:45.731 INFO STORAGE_SET Storage set to gcd://backups/duplicacy
2019-01-07 13:17:45.732 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_GCD_TOKEN
2019-01-07 13:17:45.733 DEBUG PASSWORD_PREFERENCE Reading gcd_token from preferences
2019-01-07 13:17:48.406 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_GCD_TOKEN
2019-01-07 13:17:48.406 DEBUG PASSWORD_PREFERENCE Reading gcd_token from preferences
2019-01-07 13:17:48.407 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_PASSWORD
2019-01-07 13:17:48.407 DEBUG PASSWORD_PREFERENCE Reading password from preferences
2019-01-07 13:17:49.400 TRACE CONFIG_ITERATIONS Using 16384 iterations for key derivation
2019-01-07 13:17:49.684 DEBUG STORAGE_NESTING Chunk read levels: [1], write level: 1
2019-01-07 13:17:49.709 INFO CONFIG_INFO Compression level: 100
2019-01-07 13:17:49.710 INFO CONFIG_INFO Average chunk size: 4194304
2019-01-07 13:17:49.710 INFO CONFIG_INFO Maximum chunk size: 16777216
2019-01-07 13:17:49.710 INFO CONFIG_INFO Minimum chunk size: 1048576
2019-01-07 13:17:49.710 INFO CONFIG_INFO Chunk seed: hue
2019-01-07 13:17:49.711 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_PASSWORD
2019-01-07 13:17:49.711 DEBUG PASSWORD_PREFERENCE Reading password from preferences
2019-01-07 13:17:49.711 DEBUG LIST_PARAMETERS id: chat.tbp.land, revisions: [], tag: , showStatistics: false, checkFiles: false, searchFossils: false, resurrect: false
2019-01-07 13:17:49.711 INFO SNAPSHOT_CHECK Listing all chunks
2019-01-07 13:17:49.711 TRACE LIST_FILES Listing chunks/
[...................................................]
2019-01-07 13:36:01.154 DEBUG GCD_RETRY [0] Get https://www.googleapis.com/drive/v3/files?alt=json&fields=nextPageToken%2Cfiles%28name%2C+mimeType%2C+id%2C+size%29&pageSize=1000&pageToken=hue&q=%271HdxxmkHiA87dhSUq08VTw3iUqcjK442N%27+in+parents+and+trashed+%3D+false+: unexpected EOF; retrying after 2.92 seconds (backoff: 2, attempts: 1)
[...................................................]
2019-01-07 14:02:12.045 TRACE SNAPSHOT_LIST_REVISIONS Listing revisions for snapshot chat.tbp.land
2019-01-07 14:02:12.822 DEBUG DOWNLOAD_FILE_CACHE Loaded file snapshots/chat.tbp.land/1 from the snapshot cache
2019-01-07 14:02:12.824 INFO SNAPSHOT_CHECK 1 snapshots and 1 revisions
2019-01-07 14:02:12.846 INFO SNAPSHOT_CHECK Total chunk size is 2439G in 522249 chunks
2019-01-07 14:02:12.850 DEBUG CHUNK_CACHE Chunk 723ba6b35f29ad2b2fc4014c1bc504d416173653fef149ccb6d5041976fcb05e has been loaded from the snapshot cache
2019-01-07 14:02:12.851 INFO SNAPSHOT_CHECK All chunks referenced by snapshot chat.tbp.land at revision 1 exist
2019-01-07 14:02:12.852 DEBUG DOWNLOAD_FETCH Fetching chunk 723ba6b35f29ad2b2fc4014c1bc504d416173653fef149ccb6d5041976fcb05e
2019-01-07 14:02:12.852 DEBUG CHUNK_CACHE Chunk 723ba6b35f29ad2b2fc4014c1bc504d416173653fef149ccb6d5041976fcb05e has been loaded from the snapshot cache
2019-01-07 14:02:12.853 INFO SNAPSHOT_CHECK
          snap | rev |                          | files |   bytes | chunks |   bytes | uniq |   bytes | new |   bytes |
 chat.tbp.land |   1 | @ 2019-01-07 12:05 -hash |     1 | 25,710K |      9 | 10,297K |    9 | 10,297K |   9 | 10,297K |
 chat.tbp.land | all |                          |       |         |      9 | 10,297K |    9 | 10,297K |     |         |

saspus · 7 January 2019 23:09

Out of curiosity I’ve tried to lookup QPSs and QPDs for GCS and GCD and white there are a lot of specs posted for GCS there is not much for GCD and my impression is that the REST API Google provides for GCD is intended for very lightweight infrequent use – such as upload a picture or sync a document in the application and therefore the quotas and rates are designed to support this case and prevent abuse – which using GCD as a backup backend probably is. (Therefore there is no need to specify them in terms of requests per seconds type of metric – for the intended use that metric does not make much sense and probably that’s why there no information in the quotas and rates for this service)

This would be totally reasonable provided GCS is available with much more lax and fully published limits and performance to suit this type of use.

I had the same crap experience with other backup tools that support OneDrive and DropBox destinations as well (Synology’s HyperBackup being one of them). Performance ether sucked or requests (including and especially List/Enumerate) would time out; it was absolutely unusable with either of those services on anything but trivially sized datasets. And yet, the same software did not have any issues with Azure and S3 type of services.

Therefore while advertising support for those services there should be a disclaimer somewhere that those should not be use with “large” datasets. But people will ignore that anyway and then complain about poor performance.

In my opinion duplicacy (and all other backup tools for that matter) shall consider dropping support (Or at least hiding it under Advanced/Beware/YouOnYourOwnNow settings) for OneDrive, Google Drive and DropBox, (and perhaps few other “Drive services” (Amazon Drive being the exception, which seems to be ridiculously fast, but is now closed for third party integrations (wonder why :)) and also forever broken for duplicacy unfortunately) because (again, in my personal opinion) every feature should either provide ideal user experience or not be offered at all, especially those that are likely to be (ab)used by users – every one has GD or DropBox so that is what people would try first… I did…

gchen · 8 January 2019 03:22

This is a known issue: Duplicacy 'check' command can be very slow · Issue #397 · gilbertchen/duplicacy · GitHub

Last time I checked, it looked like GCD didn’t rate limit the listing API call by returning 403 errors (Handling API Errors | Drive REST API | Google Developers); instead, they just throttled the connections to intentionally slow down the responses. So, the only way to speed it up is perhaps using multiple listing threads.

TheBestPessimist · 8 January 2019 05:44

Although I remember that issue, i didn’t remember it having the same reason, that’s why i created this issue on the forum.

I have noticed that Google Drive File Stream is extremely fast the first time i log in (eg. after installing on a new pc). By fast i mean that it lists/caches my gdrive folder (with all the 8TB of files) in less than 10 minutes. I don’t think that speed can come just from multithreading the list requests.

So although multithreading should be added (as it will help every storage backend), for gdrive i’m thinking that batching may also help.

sggalan · 7 February 2023 08:31

Hello,

Still testing if Duplicacy suits our business needs. This is my second challenge and is related to this topic as far as I can see.

I have set up a linux storage dedicated server (80TB of btrfs filesystem) for a Duplicacy application running on a Windows server, and connected to the storage using SFTP from Windows to linux through a 2Gbps network link (both to the same Cisco 48x10Gbps switch). Both servers have threads, RAM and Ghz enough for any task.

The first backups were reasonably fast (chunk listing for a few minutes, and after that, about 400~500Mbps of almost constant data transfer). But as long as storage collects more chunks (about 12TB of stored data at Windows side, or about 8TB at chunks side, with about 2,000,000 chunks), even the tiniest backup task now takes up to 4 hours (never less than 1 hour) to ‘list all chunks’ even at night, with no users taking data from anywhere.

Traffic between both servers reaches peaks of no more than 1Mbps while in the chunk list stage (Windows server has 10Gpbs available, linux server 2Gbps). CPU load is not more than 0.4% for any of the servers, (I tried check tasks with -threads 4 but it’s not faster than when using -threads 1) so I cannot figure out what is the reason for chunk listing being soooooo slow.

(So the issue might not be related to Google Drive, but to any remote storage, as I can see)

I have concluded that, if I can store all my data in the same source drive, instead of making backups of different directories along the server’s disks, it can be better to backup one entire drive and set some exclusions to it. At least, chunk listing will be performed only once per backup. Am I right?

What about creating a ‘storage-server-side-application’ for storage related tasks like ‘check’ or ‘chunk listing and then transfer result to client in a row’? This should enhance performance and parallel processing.

Droolio · 7 February 2023 19:03

This is to be expected with chunks in the millions - even over high speed LAN. AFAIK, most storage types - including and especially sftp - use a single thread when doing ListAllFiles(), regardless of -threads Google Drive was optimised to bulk send the file list, I gather due to specific API support.

It’d be really nice if sftp could be multithreaded, for sure.

The good news is incremental backups (subsequent backups of the same snapshot ID) doesn’t do the file listing process.

It does mean the first backup of any snapshot ID, going to a storage where millions of chunks already exist, will take a significant amount of time.

(I encountered this with Vertical Backup, which ran out of RAM, so had to backup to a temporary, empty, storage that was made copy-compatible, and then copied the backup to the real storage using a system with enough RAM. Subsequent backups run fine.)

Honestly, I dunno why Duplicacy has to do it this way.

Sure it’s normally more efficient to get a chunk list, and check against it before making another API call to check whether it really exists, before upload. But there comes a threshold where it’s no longer efficient. Heck I could start with an empty repository (source), make a backup, and incremental backups would have an empty chunk list to test against. So why not imagine a revision file 0, with nothing in it, and start from there?