The "check" command is so slow that it can never finish. Is there anything that can be done to improve the speed of "check"?

I’ve noticed that check actually never finishes for me, because it’s just way too slow. Here’s the beginning of the log:

Running check command from C:\ProgramData/.duplicacy-web/repositories/localhost/all
Options: [-log check -storage BackupName -threads 10 -a -tabular]
2022-09-26 21:00:02.091 INFO STORAGE_SET Storage set to gcd://Duplicacy
2022-09-26 21:00:06.047 INFO SNAPSHOT_CHECK Listing all chunks
2022-09-27 01:32:30.916 INFO SNAPSHOT_CHECK 4 snapshots and 36806 revisions
2022-09-27 01:32:30.994 INFO SNAPSHOT_CHECK Total chunk size is 19453G in 6319161 chunks
2022-09-27 01:32:33.912 INFO SNAPSHOT_CHECK All chunks referenced by snapshot 4 at revision 1 exist
2022-09-27 01:32:36.467 INFO SNAPSHOT_CHECK All chunks referenced by snapshot 4 at revision 2 exist
2022-09-27 01:32:38.982 INFO SNAPSHOT_CHECK All chunks referenced by snapshot 4 at revision 3 exist
2022-09-27 01:32:41.384 INFO SNAPSHOT_CHECK All chunks referenced by snapshot 4 at revision 4 exist
2022-09-27 01:32:44.076 INFO SNAPSHOT_CHECK All chunks referenced by snapshot 4 at revision 5 exist
2022-09-27 01:32:46.772 INFO SNAPSHOT_CHECK All chunks referenced by snapshot 4 at revision 6 exist
2022-09-27 01:32:49.413 INFO SNAPSHOT_CHECK All chunks referenced by snapshot 4 at revision 7 exist
2022-09-27 01:32:52.111 INFO SNAPSHOT_CHECK All chunks referenced by snapshot 4 at revision 8 exist
2022-09-27 01:32:54.584 INFO SNAPSHOT_CHECK All chunks referenced by snapshot 4 at revision 9 exist
2022-09-27 01:32:57.056 INFO SNAPSHOT_CHECK All chunks referenced by snapshot 4 at revision 10 exist

So one of the first steps in the log, Listing all chunks, already takes 4.5 hours.

After that, when it actually starts with the “real” work, it is doing one such All chunks referenced by snapshot 4 at revision X exist every ~3 seconds. For the check to finish, it would need to run this 36806 times it seems, because there are 36806 revisions. That’s 110,418 seconds, or 1840 minutes, or 30.67 hours. So in total, it would need to run for over 35 hours to finish. My PC is at most running 15 hours per day, I shut it down at night (especially with how the energy prices have risen lately). So in those 15 hours that it runs, it’s impossible to finish a command that needs 35 hours to finish.

Is there anything that can be done to improve the speed of check? I am running it with -threads 10 already, but I’m not sure if that really affects the speed at all.

I am using the Duplicacy WebUI, and am backing up to Google Drive. My Internet speed is 500/50 Mbit’s, my CPU is a Ryzen 9 3950X, I’m on Windows 10, and Duplicacy is installed on my very fast Samsung 980 Pro 2 TB NVMe SSD. So hardware-wise it should already be as fast as possible.

Yes, to have the check run faster (which, by nature, is supposed to query every single file) use low latency backend. It’s a simple as that.

Google Drive is designed for an entirely different purpose and does not perform well when [ab]used as a replacement for S2/GCS/Azure. There is some small wiggle room to improve batching, similar to --fast-list rclone implements, but it’s not magic, ultimately fetching metadata for millions of files via goodle drive API will still be taking ages.

Once you are in the realm of needing a few hours just to list chunks — it’s time to stop abusing the file sharing and collaboration service (which google drive is; you are paying for those features, free storage is incidental to that main value proposition) and move on to the actual cloud storage (GCS, S3, Azure, etc). Using the right tool for the job, as the saying goes. Yes, you won’t get unlimited storage for pennies — but you have to pick two between speed, cost, and reliability, and when the storage of backup is concerned, where you want speed and performance, you can’t expect it to be free

On a separate note, looking at your log, I see few ways to improve your performance somewhat:

  1. Don’t use more than 4 threads with google drive (to avoid throttling)
  2. Create your own Google project and issue yourself credentials from there. This will prevent you from inheriting limitations of a shared one duplicacy is issuing tokens from.

This will be marginally better, but don’t expect miracles.

2 Likes

Interesting, thanks!

I’m surprised you say that Google Drive isn’t well suited for this. When I started using Duplicacy around two years ago or so, I was actually using “Google Cloud Storage” at first, so a “real” cloud backend, but Duplicacy was doing so many files operations that were really expensive that it wasn’t sustainable at all. Also, I’m really not a fan of having to use one of those complex cloud storage solutions. I am just a regular person wanting to backup my PC, I don’t understand all the complex UI of something like Google Cloud Storage. I don’t want to have to know what a bucket is and what their millions of options and menus mean.

So after that, I then switched to Google Drive, which is a nice flat $20 a month for unlimited storage on the enterprise plan, and way more convenient to use as a regular user. And so far I did not have any significant issues with using it for Duplicacy. I can backup (upload) with pretty much my full 50 Mbit’s upload speed, and backups also seem to run and finish reasonably fast. Restores unfortunately don’t reach my 500 Mbit’s download speed, they’re also only around 50 Mbit’s, but since I hopefully will never need to restore a lot of data, that should be fine too.

So since what matters works very well with Google Drive, do you really think it’s not best option?

As far as I understand, “check” is not super important to run. I have read in another thread that if you never run “prune” (which I don’t), then in theory you also never need to run “check” as Duplicacy will only add new things on top, and so there’s nothing that could break anyways. Is that not correct? But without being able to run “check”, I can’t get pretty graphs in the dashboard, so that’s probably the main reason why I want to run it.

Also, since it’s certainly not important to run check super often, couldn’t Duplicacy just be configured to continue the “check” where it stopped after a PC restart? Then it wouldn’t matter if it takes 10 or 30 or 100 hours, as long as it finishes once a month or so, that would probably be good enough.

  1. Don’t use more than 4 threads with google drive (to avoid throttling)

Thanks, I’ll try that! Is there actually some throttling limit from google drive that specifically states 4 threads is ideal?

  1. Create your own Google project and issue yourself credentials from there. This will prevent you from inheriting limitations of a shared one duplicacy is issuing tokens from.

Since it’s been quite a while since I started doing this backup, I don’t really remember how I configured it. Is there anything in the log that shows you that I am not already using my own credentials from Google?

Here is more detailed comment on the topic, (Newbie) Backup entire machine? - #6 by saspus, but while google drive is one of the best ones out there in terms of robustness and resilience (for example, with OneDrive you would have hit a wall much sooner) it still suffers from the same issues. I’ve been using G-suite and later Workspace for various uses, including mountable cloud storage and backup target, and eventually gave up.

Yes, it can get expensive. But that sort of storage is infinitely scalable, so for large backups it may be the only option. I’m still looking forward for duplicacy to support archival storage like Amazon Glacier Deep Archive – few other tools do, and with those tools cost is very manageable: for my 2TB ongoing backup I’m paying about $2.80/month.

I agree though that using hot storage – such as B2, Wasabi, or Amazon S3 Standard is a massive overkill and a waste of money: it’s expensive because it has properties that are useless in backup applications.

Yes, if this is acceptable to you – no reason not to continue using it, with the understanding that nothing unlimited lasts forever, eventually google may figure out how to fix quotas, and close this loophole. When and if this will happen – is anybody’s guess, but until it lasts I don’t see the reason not to use it; if drawbacks are acceptable.

I think it’s subjective, and I think it’s not the best option for all those reasons listed above and in the linked comment. But if

  • your backup manages to finish fast enough to keep up with new data being generated
  • Check does not need to be fast – it can run in the background, launched from say a cloud instance, if you don’t want to keep your PC running all the time
  • Restore is a rare event and therefore no reason to optimize it

– then it is fine.

Correct. I had a lot of prune failures with google drive, that kept leaving stale snapshots in the datastore, which they check complained about. It needs to be fixed in duplicacy, it was discussed many times, so we just wait for gchen to get around implementing it.

On the other hand you are right – it’s unlimited storage, there is no point in pruning (as long as number of snapshots does not grow too fast). I personally don’t prune anymore at all. This also allows to make storage immutable with certain remotes, further hardening backup.

IIRC this is what provided best performance and avoided “backing off” messages in the log (your an add -d to the global flags to get more verbose logging and see what’s going on there).

If you have created a token from duplicacy.com/gcd_start then you are using the shared project. Open the token file in the text editor and see if there is duplicacy.com/gcd_refresh mentioned anywhere. I’ve been using service account with a unreleased version of duplicacy that honors few additional flags in the token file, and described the process here: Duplicacy backup to Google Drive with Service Account | Trinkets, Odds, and Ends

But in all fairness, if it works for you as is – I would just continue using it. Schedule check if you want to in a separate job so it does not block the ongoing backup. Duplicacy is fully concurrent and this approach shall work fine.

1 Like

Interesting, thanks!

Yes, if this is acceptable to you – no reason not to continue using it, with the understanding that nothing unlimited lasts forever, eventually google may figure out how to fix quotas, and close this loophole. When and if this will happen – is anybody’s guess, but until it lasts I don’t see the reason not to use it; if drawbacks are acceptable.

I’m not sure why you call it a “loophole”? I am using the “Google Workspace Enterprise Standard” plan. With that plan you officially get unlimited google drive space for $20/user/month, it’s not a “loophole”. I’m sure Google wouldn’t offer it if they wouldn’t make a profit on average from it. Though to be fair, the option to select that plan is quite hidden in their UI.

  • your backup manages to finish fast enough to keep up with new data being generated

Backup is really perfectly fast. I have “backup” set to hourly, and the “backup” job for my 2 TB C Drive for example needs a total of 10 minutes to run when it doesn’t find any changed data. So I can nicely run it hourly. And when it does find changed data, I see that it’s uploading with near 50 MBit’s, so it’s limited by my upload speed and not by anything else.

  • Check does not need to be fast – it can run in the background, launched from say a cloud instance, if you don’t want to keep your PC running all the time

How would I do that? How can I run “check” from a “cloud instance”? It does not need access to the local files?

IIRC this is what provided best performance and avoided “backing off” messages in the log (your an add -d to the global flags to get more verbose logging and see what’s going on there).

Interesting, I’ll certainly add -d then. So until I see any “backing off” messages I can increase the thread count for more speed? Does the -threads option actually have any effect on “check”?

If you have created a token from duplicacy.com/gcd_start then you are using the shared project. Open the token file in the text editor and see if there is duplicacy.com/gcd_refresh mentioned anywhere.

Is that token file stored in any duplicacy folder? If so, I can’t find it. Where would that be?

But in all fairness, if it works for you as is – I would just continue using it. Schedule check if you want to in a separate job so it does not block the ongoing backup. Duplicacy is fully concurrent and this approach shall work fine.

I have always been running all my backups and check in parallel, yes.

I just tried running the check it with -d -threads 100, and then I do get a lot of these messages in the log:

DEBUG GCD_RETRY [0] User Rate Limit Exceeded. Rate of requests for user exceed configured project quota. You may consider re-evaluating expected per-user traffic to the API and adjust project quota limits accordingly. You may monitor aggregate quota usage and adjust limits in the API Console: https://console.developers.google.com/apis/api/drive.googleapis.com/quotas?project=243147021227; retrying after 2.33 seconds (backoff: 2, attempts: 1)

So yes, the -threads option does seem to do something.

And it seems 243147021227 is the project ID from the shared Duplicacy project, so it seems I am not using a project token I configured myself. If I would configure a project token myself, could I set the rate limit so high that I could really use 100 threads? What’s the downside then?

Also, if I create my own token with my own project, can I change my existing backup to use that somehow? It seems the token is supplied when initializing a duplicacy backup using duplicacy init, I can’t find how to change the token for an existing setup like mine?

No, they never use the word “unlimited”. They specifically say " As much storage as you need*" with an asterisk that it’s 5TB of pooled storage per user, but enterprise customers have an option to request more, on an as-needed basis. The loophole is in that those quotas have never been enforced, and de facto, even on $12/month Business Standard plan you can upload unlimited amount of data. When will they close this bug – nobody knows but them.

In other words, you are not paying $20 for unlimited storage; you are paying for enterprise SaaS features, endpoint management, data vaults, etc, that you may or may not use, and along with that get some amount of storage; it’s not the main product, it’s just incidental to all the other features and is pretty much given away for free. You happen to get unlimited storage as long as google continues to neglect enforcing quotas, for whatever reasons they have for doing so.

I don’t work for google, and don’t know details on why are they not enforcing published quotas; regardless, this is a very weak argument: plenty of companies who offered unlimited data on various services ceased to do so. Amazon Drive comes to mind as one of the recent departures.

It’s not really hidden – it’s along the rest of plans in the dashboard.

Check only checks that the chunks snapshot files refer to exist in the storage. IF you add -chunks argument, it will also download chunks and verify their integrity. You can setup duplicacy on a free oracle cloud instance and have it run check periodically. I would not do that, I’m of opinion that storage must be trusted, and it’s not a job of a backup program to worry about data integrity – but it’s a possibility.

Yes, but do you need more speed? If anything, you’d want to slow down the backup to avoid surges in usage. As long as daily new data gets eventually uploaded, speeding up backup buys you nothing. I was running duplicacy under CPULimit-er in fact, because I did not want it to run fast.

Look in C:\ProgramData/.duplicacy-web/repositories/localhost/0/.duplicacy/preferences file. It will refer to the token file.

Awesome. THat’s the backoff thing I was referring to.

I don’t have an answer to that… I’ve been using it with 1 thread. But I’d expect there to be some per-user limits on a number of concurrent connections. The downside is that at some point managing a Buch of threads will be more overhead than benefit. But you can play with it, see how it work.

Yes, duplicacy does not care how it accesses files as long as it does.

Duplicacy saves a path to a token file (even though it would have been more logical to save the token content". You can just replace that file with something else. Or just delete the storage, add then a new one this time using your new token. If you give the storage the same name you won’ need to redo schedules, they will continue working. It will be the easiest way to do that than thyring to swipe the carpet from under it. Especially, since you use WebGUI, actual path to the token file is encrypted in duplicacy_web.json, and the .duplicacy/preferences files are generated from that. So your pretty much only way is to delete the storage from duplicacy_web and then add it back with the same name.

1 Like

No, they never use the word “unlimited”. They specifically say " As much storage as you need*" with an asterisk that it’s 5TB of pooled storage per user

When I go to the UI where I see my current plan, I see it say “as much storage as you need”, and I can’t see any asterisk.

Yes, but do you need more speed? If anything, you’d want to slow down the backup to avoid surges in usage. As long as daily new data gets eventually uploaded, speeding up backup buys you nothing. I was running duplicacy under CPULimit-er in fact, because I did not want it to run fast.

I was talking about the “check” command, where it appears I do need more speed to be able to use it at all.

Look in C:\ProgramData/.duplicacy-web/repositories/localhost/0/.duplicacy/preferences file. It will refer to the token file.

That preference file at that path exists, but in there, I can’t see anything refer to any token file.

So your pretty much only way is to delete the storage from duplicacy_web and then add it back with the same name.

Interesting. So it’s safe to delete storage in the WebUI, there’s no way that it will accidentally delete any other stuff that’s needed? So when I add it back, everything is guaranteed to still work?

IIRC, technically Enterprise plans are unlimited as long as there are 5 licensed users or more per organization. Until then it is 5TB per user. But they indeed do not enforce these quotas on Enterprise plans, though I’ve heard that some users over the limit received email notifications earlier this year.

To be clear, the reason @TyTro check is taking ages has little to do with Google. It can take hours even with local storage - with 6 million chunks, and 38k snapshots - the ListAllFiles() process can take eons.

This is too much overhead for Duplicacy to deal with in a linear manner, and is closer to an exponential curve. It’ll only get worse the longer OP omits to prune their hourly snapshots. This, to me, is the main purpose of prune - not just to reduce the total amount of data, but to make the number of snapshots more manageable - for operations like prune, check, and copy.

@TyTro Even a ‘light’ prune schedule of -keep 1:14 will help you out immensely. It’ll get rid of the hourly snapshots beyond 2 weeks, keeping only 1 a day forever. (Do you really need hourly snapshots going back that far? If so, I’d recommend git to manage your project, or use different retentions for different backup IDs.) Either way, would strongly suggest you consider doing this. Not only will it reduce the number of snapshots and chunks - if you run it regularly, the runtime from then on will remain relatively fixed, and linear.

Incidentally, Duplicacy could indeed support batch list (implemented by Rclone’s --fast-list) and probably speed up the listing process x20, which isn’t insignificant. Already supported by Google’s API, no more excuses to keep knocking it.

3 Likes

Thanks!

@TyTro Even a ‘light’ prune schedule of -keep 1:14 will help you out immensely. It’ll get rid of the hourly snapshots beyond 2 weeks, keeping only 1 a day forever. (Do you really need hourly snapshots going back that far? If so, I’d recommend git to manage your project, or use different retentions for different backup IDs.) Either way, would strongly suggest you consider doing this. Not only will it reduce the number of snapshots and chunks - if you run it regularly, the runtime from then on will remain relatively fixed, and linear.

I don’t need hourly snapshots, I just set it to hourly thinking “more doesn’t hurt, I have a fast PC and infinite storage, so why not”. I didn’t know that Duplicacy scales badly with many revisions. Once per day or once per week would be totally fine for me.

I have changed my schedules to once per day now, that should help in the future then. I am also trying to run prune as you suggested, but unfortunately I’m not having much success with that, prune seems to take forever and doesn’t even show me any indicator how long it will take. I’ve created a new thread about the issue with prune:

I’ve been using service account with a unreleased version of duplicacy that honors few additional flags in the token file, and described the process here: Duplicacy backup to Google Drive with Service Account | Trinkets, Odds, and Ends

@saspus I am trying to follow that nice guide you wrote now that my CLI version is 3.0.1. But I have an issue at step 18.

In there you write to enable the checkbox “Enable G Suite Domain-wide Delegation”:

But that checkbox does not exist for me. Probably because I am using Google Workspace, not G Suite. Do I need that checkbox, or can I just ignore that it does not exist?

This is how it looks like for me:

Edit:

I do think it works fine, even without being able to check that checkbox. Duplicacy seems to be able to successfully access the storage.

Now I just have the issue that for doing this:

delete the storage from duplicacy_web and then add it back with the same name.

I seem to need to know the encryption password, right? And it seems I don’t know that any more. Or more exactly, the password I remember it clearly being is not recognized as the correct password if I try entering it in the “Settings” of Duplicacy Web.

Actually, I’m a bit confused, why is there an “encryption password” setting in the settings of Duplicacy Web, that seems to globally apply to all storages, and then there’s another encryption password I need to enter when adding a new storage? Are those different passwords or the same one?

There is a link about delegation, no? What’s written there? Maybe they moved the checkbox elsewhere.

Without enabling delegation service account won’t be able to access user’s account. It will only see its own 35GB account.

That was on google workspace btw, they still referred to g-suite in many places. I don’t have google workspace account anymore, so I can’t look it up for you, sorry.

Ensure data is actually goes to the intended account and not service account itself.

This protects the web UI itself and encrypts other passwords. It is not storage encryption.

If you don’t have your storage encryption password anymore, and cannot extract it from a keychain — start over, as this backup is useless for you.

1 Like

Ensure data is actually goes to the intended account and not service account itself.

I have tested adding a new storage, and Duplicacy is correctly adding a new folder with the name I enter on the correct google drive in the correct path. So it seems to work correctly.

This protects the web UI itself and encrypts other passwords. It is not storage encryption.
If you don’t have your storage encryption password anymore, and cannot extract it from a keychain — start over, as this backup is useless for you.

I assume that means the password I remember is the password for the storage, and not the password for the web UI then. Can I reset the password for the Web UI somehow? I don’t remember at all that I would have created two different passwords when I set up Duplicacy.

So just to be sure I understand it correctly, there are three different passwords in total? Because in the Settings of the Web UI there’s also the “Administrative Password”, which is blank for me as I don’t use it. So there is the “Encryption Password” of the WebUI, then there’s the “Administrative Password” of the WebUI, and then there’s a per-storage “Encryption Password”?

Is there any way how I test if I still remember the password for the storage? Without deleting my current storage, as I’d like to keep it if I don’t actually remember the password.

I managed to remember the password, so all is working now I think. I have deleted the storage, added it back with the same name and my custom .json file, and as far as I can see, everything is working now. So now I need to try running check or prune with 50 threads and hopefully it will be faster :slight_smile:

Edit:

I’m running a check and a prune simultaneously with each 50 threads now, and I can see in the Google Drive metrics that Duplicacy is only doing ~4 API calls per second. The CPU usage of Duplicacy is ~9%, the RAM usage is 2.4 GB (of 128 GB available), and disk usage is 0.1 MB/s. check does not seem to be any faster now. So I’m still not sure what actually slows it down?

I would really like to either get my CPU to 100%, get my RAM to 100%, get my SSD to 100%, or get my Google Drive API Limit to 100%, which is 20,000 API calls per 100 seconds. So what exactly do I need to do to achieve this, maxing out at least one of these things?

Or is that impossible because the problem is actually that the slow check code in Duplicacy is single-threaded and doesn’t benefit from good hardware or a good cloud service?

I can see in task manager that the duplicacy_win_x64_3.0.1.exe is only having a total of 38 threads. Why? I am clearly running it with -threads 50, and I’m sure windows isn’t lying, so why is it not spawning at least 50 threads?

It won’t be faster and I would not exceed 4 threads. If you want crazy parallelism use google cloud storage or amazons3. Google drive was not designed for the usecase duplicacy users force on it and either due to design decisions or anti abuse limitations latency will be quite large.

This however does not matter at all. Backup is a background process, if prune takes 4 weeks — there is nowhere to hurry.

It won’t be faster and I would not exceed 4 threads. If you want crazy parallelism use google cloud storage or amazons3. Google drive was not designed for the usecase duplicacy users force on it and either due to design decisions or anti abuse limitations latency will be quite large.

That does not make sense to me based on what I’m seeing. Why do you think that I am slowed down in any way by Google Drive? I am far from my API limits (20000 queries per 100 seconds). Latency should be completely irrelevant when stuff can happen in parallel. Even if asking “does a chunk exist” would take 1 minute to get a reply, it could still be fast to check if 5 million chunks exist by just doing 10000 things in parallel. Which is what the threads option is doing, right?

So I don’t think you’re right here, this does not seem to be related to any limit from Google Drive. I’m sure this would be exactly same fast on S3. Because check does not even use any network stuff at all for the slow part of the work. 0 API calls for hours. It’s only working on local data. The limit here seems to be the code that Duplicacy is running not being able to make full use of the available resources.

This however does not matter at all. Backup is a background process, if prune takes 4 weeks — there is nowhere to hurry.

It’s normal to reboot a PC at least one a day - so if something takes 4 weeks, and can’t resume after a restart, it can never finish for an average user, which is a big problem.

Are you passing any arguments to check? If not, it’s just a list request(s). They are very slow with google drive. Having thousands of files in one “folder” is also not a usecase google drive is optimized for. There is a way to fetch list in bulk (rclone uses it, see —fast-list option, it’s marginally faster) but duplicacy does not do that. From my experience, prune on a 2.5TB google drive duplicacy backup took about 14 days. Now I’m using tools and services designed for the purpose and don’t have these issues, and recommend everyone do the same. Few bucks savings a month comes at a huge cost of time investment, and is a false economy. My advice — don’t use *-drive services as backup targets.

It’s horrific and absolutely not normal. Why would you do that so often? If you are having issues causing you to restart — address those issues first. I, and everyone I know, only reboot on major OS updates. Even portables — sleep/wake/sleep/wake… I’ve just checked, uptime on my laptop is 45 days. Windows gaming rig — 81 days (I disabled updates)

Also, check your network latency — maybe bufferbloat play a role here amplifying the effect

Did you manage to bring the number of revisions down to a sensible number?