The "check" command is so slow that it can never finish. Is there anything that can be done to improve the speed of "check"?

TyTro · 27 September 2022 06:59

I just tried running the check it with -d -threads 100, and then I do get a lot of these messages in the log:

DEBUG GCD_RETRY [0] User Rate Limit Exceeded. Rate of requests for user exceed configured project quota. You may consider re-evaluating expected per-user traffic to the API and adjust project quota limits accordingly. You may monitor aggregate quota usage and adjust limits in the API Console: https://console.developers.google.com/apis/api/drive.googleapis.com/quotas?project=243147021227; retrying after 2.33 seconds (backoff: 2, attempts: 1)

So yes, the -threads option does seem to do something.

And it seems 243147021227 is the project ID from the shared Duplicacy project, so it seems I am not using a project token I configured myself. If I would configure a project token myself, could I set the rate limit so high that I could really use 100 threads? What’s the downside then?

Also, if I create my own token with my own project, can I change my existing backup to use that somehow? It seems the token is supplied when initializing a duplicacy backup using duplicacy init, I can’t find how to change the token for an existing setup like mine?

saspus · 27 September 2022 07:31

No, they never use the word “unlimited”. They specifically say " As much storage as you need*" with an asterisk that it’s 5TB of pooled storage per user, but enterprise customers have an option to request more, on an as-needed basis. The loophole is in that those quotas have never been enforced, and de facto, even on $12/month Business Standard plan you can upload unlimited amount of data. When will they close this bug – nobody knows but them.

In other words, you are not paying $20 for unlimited storage; you are paying for enterprise SaaS features, endpoint management, data vaults, etc, that you may or may not use, and along with that get some amount of storage; it’s not the main product, it’s just incidental to all the other features and is pretty much given away for free. You happen to get unlimited storage as long as google continues to neglect enforcing quotas, for whatever reasons they have for doing so.

I don’t work for google, and don’t know details on why are they not enforcing published quotas; regardless, this is a very weak argument: plenty of companies who offered unlimited data on various services ceased to do so. Amazon Drive comes to mind as one of the recent departures.

It’s not really hidden – it’s along the rest of plans in the dashboard.

Check only checks that the chunks snapshot files refer to exist in the storage. IF you add -chunks argument, it will also download chunks and verify their integrity. You can setup duplicacy on a free oracle cloud instance and have it run check periodically. I would not do that, I’m of opinion that storage must be trusted, and it’s not a job of a backup program to worry about data integrity – but it’s a possibility.

Yes, but do you need more speed? If anything, you’d want to slow down the backup to avoid surges in usage. As long as daily new data gets eventually uploaded, speeding up backup buys you nothing. I was running duplicacy under CPULimit-er in fact, because I did not want it to run fast.

Look in C:\ProgramData/.duplicacy-web/repositories/localhost/0/.duplicacy/preferences file. It will refer to the token file.

Awesome. THat’s the backoff thing I was referring to.

I don’t have an answer to that… I’ve been using it with 1 thread. But I’d expect there to be some per-user limits on a number of concurrent connections. The downside is that at some point managing a Buch of threads will be more overhead than benefit. But you can play with it, see how it work.

Yes, duplicacy does not care how it accesses files as long as it does.

Duplicacy saves a path to a token file (even though it would have been more logical to save the token content". You can just replace that file with something else. Or just delete the storage, add then a new one this time using your new token. If you give the storage the same name you won’ need to redo schedules, they will continue working. It will be the easiest way to do that than thyring to swipe the carpet from under it. Especially, since you use WebGUI, actual path to the token file is encrypted in duplicacy_web.json, and the .duplicacy/preferences files are generated from that. So your pretty much only way is to delete the storage from duplicacy_web and then add it back with the same name.

TyTro · 27 September 2022 07:49

No, they never use the word “unlimited”. They specifically say " As much storage as you need*" with an asterisk that it’s 5TB of pooled storage per user

When I go to the UI where I see my current plan, I see it say “as much storage as you need”, and I can’t see any asterisk.

Yes, but do you need more speed? If anything, you’d want to slow down the backup to avoid surges in usage. As long as daily new data gets eventually uploaded, speeding up backup buys you nothing. I was running duplicacy under CPULimit-er in fact, because I did not want it to run fast.

I was talking about the “check” command, where it appears I do need more speed to be able to use it at all.

Look in C:\ProgramData/.duplicacy-web/repositories/localhost/0/.duplicacy/preferences file. It will refer to the token file.

That preference file at that path exists, but in there, I can’t see anything refer to any token file.

So your pretty much only way is to delete the storage from duplicacy_web and then add it back with the same name.

Interesting. So it’s safe to delete storage in the WebUI, there’s no way that it will accidentally delete any other stuff that’s needed? So when I add it back, everything is guaranteed to still work?

sevimo · 27 September 2022 10:11

IIRC, technically Enterprise plans are unlimited as long as there are 5 licensed users or more per organization. Until then it is 5TB per user. But they indeed do not enforce these quotas on Enterprise plans, though I’ve heard that some users over the limit received email notifications earlier this year.

Droolio · 27 September 2022 13:46

To be clear, the reason @TyTro check is taking ages has little to do with Google. It can take hours even with local storage - with 6 million chunks, and 38k snapshots - the ListAllFiles() process can take eons.

This is too much overhead for Duplicacy to deal with in a linear manner, and is closer to an exponential curve. It’ll only get worse the longer OP omits to prune their hourly snapshots. This, to me, is the main purpose of prune - not just to reduce the total amount of data, but to make the number of snapshots more manageable - for operations like prune, check, and copy.

@TyTro Even a ‘light’ prune schedule of -keep 1:14 will help you out immensely. It’ll get rid of the hourly snapshots beyond 2 weeks, keeping only 1 a day forever. (Do you really need hourly snapshots going back that far? If so, I’d recommend git to manage your project, or use different retentions for different backup IDs.) Either way, would strongly suggest you consider doing this. Not only will it reduce the number of snapshots and chunks - if you run it regularly, the runtime from then on will remain relatively fixed, and linear.

Incidentally, Duplicacy could indeed support batch list (implemented by Rclone’s --fast-list) and probably speed up the listing process x20, which isn’t insignificant. Already supported by Google’s API, no more excuses to keep knocking it.

TyTro · 6 October 2022 05:56

Thanks!

@TyTro Even a ‘light’ prune schedule of -keep 1:14 will help you out immensely. It’ll get rid of the hourly snapshots beyond 2 weeks, keeping only 1 a day forever. (Do you really need hourly snapshots going back that far? If so, I’d recommend git to manage your project, or use different retentions for different backup IDs.) Either way, would strongly suggest you consider doing this. Not only will it reduce the number of snapshots and chunks - if you run it regularly, the runtime from then on will remain relatively fixed, and linear.

I don’t need hourly snapshots, I just set it to hourly thinking “more doesn’t hurt, I have a fast PC and infinite storage, so why not”. I didn’t know that Duplicacy scales badly with many revisions. Once per day or once per week would be totally fine for me.

I have changed my schedules to once per day now, that should help in the future then. I am also trying to run prune as you suggested, but unfortunately I’m not having much success with that, prune seems to take forever and doesn’t even show me any indicator how long it will take. I’ve created a new thread about the issue with prune:

TyTro · 22 October 2022 14:52

I’ve been using service account with a unreleased version of duplicacy that honors few additional flags in the token file, and described the process here: Duplicacy backup to Google Drive with Service Account | Trinkets, Odds, and Ends

@saspus I am trying to follow that nice guide you wrote now that my CLI version is 3.0.1. But I have an issue at step 18.

In there you write to enable the checkbox “Enable G Suite Domain-wide Delegation”:

But that checkbox does not exist for me. Probably because I am using Google Workspace, not G Suite. Do I need that checkbox, or can I just ignore that it does not exist?

This is how it looks like for me:

Edit:

I do think it works fine, even without being able to check that checkbox. Duplicacy seems to be able to successfully access the storage.

Now I just have the issue that for doing this:

delete the storage from duplicacy_web and then add it back with the same name.

I seem to need to know the encryption password, right? And it seems I don’t know that any more. Or more exactly, the password I remember it clearly being is not recognized as the correct password if I try entering it in the “Settings” of Duplicacy Web.

Actually, I’m a bit confused, why is there an “encryption password” setting in the settings of Duplicacy Web, that seems to globally apply to all storages, and then there’s another encryption password I need to enter when adding a new storage? Are those different passwords or the same one?

saspus · 22 October 2022 18:19

There is a link about delegation, no? What’s written there? Maybe they moved the checkbox elsewhere.

Without enabling delegation service account won’t be able to access user’s account. It will only see its own 35GB account.

That was on google workspace btw, they still referred to g-suite in many places. I don’t have google workspace account anymore, so I can’t look it up for you, sorry.

Ensure data is actually goes to the intended account and not service account itself.

This protects the web UI itself and encrypts other passwords. It is not storage encryption.

If you don’t have your storage encryption password anymore, and cannot extract it from a keychain — start over, as this backup is useless for you.

TyTro · 22 October 2022 18:23

Ensure data is actually goes to the intended account and not service account itself.

I have tested adding a new storage, and Duplicacy is correctly adding a new folder with the name I enter on the correct google drive in the correct path. So it seems to work correctly.

This protects the web UI itself and encrypts other passwords. It is not storage encryption.
If you don’t have your storage encryption password anymore, and cannot extract it from a keychain — start over, as this backup is useless for you.

I assume that means the password I remember is the password for the storage, and not the password for the web UI then. Can I reset the password for the Web UI somehow? I don’t remember at all that I would have created two different passwords when I set up Duplicacy.

So just to be sure I understand it correctly, there are three different passwords in total? Because in the Settings of the Web UI there’s also the “Administrative Password”, which is blank for me as I don’t use it. So there is the “Encryption Password” of the WebUI, then there’s the “Administrative Password” of the WebUI, and then there’s a per-storage “Encryption Password”?

Is there any way how I test if I still remember the password for the storage? Without deleting my current storage, as I’d like to keep it if I don’t actually remember the password.

TyTro · 22 October 2022 19:26

I managed to remember the password, so all is working now I think. I have deleted the storage, added it back with the same name and my custom .json file, and as far as I can see, everything is working now. So now I need to try running check or prune with 50 threads and hopefully it will be faster

Edit:

I’m running a check and a prune simultaneously with each 50 threads now, and I can see in the Google Drive metrics that Duplicacy is only doing ~4 API calls per second. The CPU usage of Duplicacy is ~9%, the RAM usage is 2.4 GB (of 128 GB available), and disk usage is 0.1 MB/s. check does not seem to be any faster now. So I’m still not sure what actually slows it down?

I would really like to either get my CPU to 100%, get my RAM to 100%, get my SSD to 100%, or get my Google Drive API Limit to 100%, which is 20,000 API calls per 100 seconds. So what exactly do I need to do to achieve this, maxing out at least one of these things?

Or is that impossible because the problem is actually that the slow check code in Duplicacy is single-threaded and doesn’t benefit from good hardware or a good cloud service?

I can see in task manager that the duplicacy_win_x64_3.0.1.exe is only having a total of 38 threads. Why? I am clearly running it with -threads 50, and I’m sure windows isn’t lying, so why is it not spawning at least 50 threads?

saspus · 22 October 2022 20:48

It won’t be faster and I would not exceed 4 threads. If you want crazy parallelism use google cloud storage or amazons3. Google drive was not designed for the usecase duplicacy users force on it and either due to design decisions or anti abuse limitations latency will be quite large.

This however does not matter at all. Backup is a background process, if prune takes 4 weeks — there is nowhere to hurry.

TyTro · 22 October 2022 20:57

It won’t be faster and I would not exceed 4 threads. If you want crazy parallelism use google cloud storage or amazons3. Google drive was not designed for the usecase duplicacy users force on it and either due to design decisions or anti abuse limitations latency will be quite large.

That does not make sense to me based on what I’m seeing. Why do you think that I am slowed down in any way by Google Drive? I am far from my API limits (20000 queries per 100 seconds). Latency should be completely irrelevant when stuff can happen in parallel. Even if asking “does a chunk exist” would take 1 minute to get a reply, it could still be fast to check if 5 million chunks exist by just doing 10000 things in parallel. Which is what the threads option is doing, right?

So I don’t think you’re right here, this does not seem to be related to any limit from Google Drive. I’m sure this would be exactly same fast on S3. Because check does not even use any network stuff at all for the slow part of the work. 0 API calls for hours. It’s only working on local data. The limit here seems to be the code that Duplicacy is running not being able to make full use of the available resources.

This however does not matter at all. Backup is a background process, if prune takes 4 weeks — there is nowhere to hurry.

It’s normal to reboot a PC at least one a day - so if something takes 4 weeks, and can’t resume after a restart, it can never finish for an average user, which is a big problem.

saspus · 22 October 2022 22:36

Are you passing any arguments to check? If not, it’s just a list request(s). They are very slow with google drive. Having thousands of files in one “folder” is also not a usecase google drive is optimized for. There is a way to fetch list in bulk (rclone uses it, see —fast-list option, it’s marginally faster) but duplicacy does not do that. From my experience, prune on a 2.5TB google drive duplicacy backup took about 14 days. Now I’m using tools and services designed for the purpose and don’t have these issues, and recommend everyone do the same. Few bucks savings a month comes at a huge cost of time investment, and is a false economy. My advice — don’t use *-drive services as backup targets.

It’s horrific and absolutely not normal. Why would you do that so often? If you are having issues causing you to restart — address those issues first. I, and everyone I know, only reboot on major OS updates. Even portables — sleep/wake/sleep/wake… I’ve just checked, uptime on my laptop is 45 days. Windows gaming rig — 81 days (I disabled updates)

saspus · 22 October 2022 22:46

Also, check your network latency — maybe bufferbloat play a role here amplifying the effect

Droolio · 22 October 2022 23:50

Did you manage to bring the number of revisions down to a sensible number?

TyTro · 23 October 2022 01:35

Are you passing any arguments to check?

I am running it with -d | -threads 50 -id X at the moment.

I can really tell you that the slow part of check is completely working on local cached data, it’s not doing anything with the network. So whether the storage is on S3 or Google Drive is completely irrelevant.
The first part of check does talk with the storage, that’s the “Listing chunks” part. That pretty much runs 5 times faster when running it with -threads 50 compared to running it with -threads 10, and I see on the Google Drive API monitor that its doing a lot of queries. So that’s perfect, but that is not where check spends the most time, especially with -threads 50 that is super quick. Check spends the most time in what comes after that, which is this what originally would have taken 35 hours for me:

INFO SNAPSHOT_CHECK All chunks referenced by snapshot X at revision Y exist
DEBUG CHUNK_CACHE Chunk X has been loaded from the snapshot cache

and that is doing 0 network requests. I see that both in task manager, and in my Google Drive API monitor. It’s all reading from the local cache. And it’s not affected by the threads option it seems. This just seems to be some badly optimized mostly single-threaded code, that’s the whole issue, and nothing else.

TyTro · 23 October 2022 01:42

I am making progress with that. Not finished yet, but should be in a few days. 3 IDs of my 4 IDs are finished running a prune with 1:14. The 4th one had a lot of stuff that needs to be deleted so that one I had to slowly move down from 1:600 to 1:570 to 1:540 etc, and that took each around 15 hours, so I could scale that quite well to finish before I shut down my PC each day. But that slow part of prune actually benefits from more network threads, so I can do that way quicker now with 50 threads instead of just 10 threads so I’ll have it done in 1-2 days.

saspus · 23 October 2022 03:44

This does not sound right. Duplicacy explicitly checks the presence of chunks on the storage. Cache is only used when the chunk is confirmed to exist on the target, and instead of downloading the file again – it will use the one already cached. The only chunks that are fetched during check are those needed to unpack the snapshot file. Those will be grabbed from cache, if exist. I’m surprised to see that it makes no network request, as the only job check is doing – checking for presence of chunk files on the target storage.

There would be no reason to check for them locally.

Try enabling profiler and look at the few stackshots during the lengthy operation, to find out were does it actually spend time at:

here: Deadlock during backup - #3 by gchen

TyTro · 23 October 2022 04:19

I’m surprised to see that it makes no network request, as the only job check is doing – checking for presence of chunk files on the target storage .

Isn’t that what’s happening in the first part of what check is doing? When it’s “Listing chunks”. This stuff:
Listing chunks/d2; 20487 items returned

That to me sounds like what’s checking if the chunks do exist on the storage. That is doing network requests, and finishing relatively quickly.

And the slow stuff that happens later is checking what chunks a specific revision references, and seeing if it’s in the list of chunks that were downloaded at the beginning. And that’s what’s all happening locally.

saspus · 23 October 2022 04:23

It does make sense. So, it’s seeking in files that takes time? is your cache folder on an HDD? Since CPU utilization is low – the only remaining thing is disk IO

From what I remember with having about 2TB on google drive, check was pretty fast, but prune would take weeks.

Do try the profiler, It’s really interesting to see what is it doing. There woudl be either wait on disk io, or somethign else strange.