Check command - is this execution time typical/expected?

bkeeper · 16 June 2020 18:23

gchen:

This commit uses multiple threads to list the chunks directory in the Google Drive backend:

github.com/gilbertchen/duplicacy

Use multiple threads to list the chunks directory for Google Drive

committed Yesterday 6:49 PM

gilbertchen

+66 -25

For the check command, the number of threads is taken from the -threads option.

I ran a test on a storage containing 988380 chunks. Here are the results showing the speedups:

Threads Listing time

1 48m46s

4 9m22s

8 4m20s

16 2m16s

It looks like Google doesn’t rate limit the listing API.

wow, I wave the same problem with wasabi.
Check operations take the longest to complete.
Almost 1h to check 500 GB.
@gchen could we implement multiple threads to check s3 storage as well?

gchen · 17 June 2020 02:04

The s3 backend can list the chunks directory recursively and it is already very efficient, so there is no need to use multiple threads.

On my 235G wasabi storage it only take 23 seconds to list 48K chunks:

2020-06-15 21:00:08.600 TRACE LIST_FILES Listing chunks/
2020-06-15 21:00:31.444 TRACE SNAPSHOT_LIST_IDS Listing all snapshot ids
...
2020-06-15 21:00:35.822 INFO SNAPSHOT_CHECK Total chunk size is 235,363M in 47917 chunks

bkeeper · 17 June 2020 16:45

That’s strange,
For me with about 500 GB it takes 30 mins to list 98451 chunks and 30 additional minutes to check all snapshots and revisions.
So a total of 1h.
what could be wrong @gchen?.

bkeeper · 17 June 2020 19:18

Just ran check -d:

2020-06-17 18:46:55.515 TRACE LIST_FILES Listing chunks/
2020-06-17 18:47:39.364 TRACE SNAPSHOT_LIST_IDS Listing all snapshot ids
2020-06-17 19:18:40.592 INFO SNAPSHOT_CHECK Total chunk size is 427,554M in 105613 chunks

It takes 31 minutes. for 432 GB

What am I doing wrong?

leerspace · 17 June 2020 22:10

Maybe nothing. Which region are you using?

As another wasabi data point (us-west-1 region, -log check -fossils -resurrect -a -tabular), listing all chunks takes me a little more than 4 minutes for ~370000 chunks (~1.6 TB). The rest of the check operation takes ~8 minutes.

Droolio · 18 June 2020 00:02

Hmm the time between ‘Listing chunks’ and ‘Listing all snapshot ids’ in your log is 44 seconds.

How many snapshots does your storage have? Coz my recent GCD check, with -d, has a line in there about this, before the ‘Total chunk size’ bit:

2020-06-16 15:48:30.272 INFO SNAPSHOT_CHECK Listing all chunks
2020-06-16 16:41:00.007 INFO SNAPSHOT_CHECK 13 snapshots and 1910 revisions
2020-06-16 16:41:00.011 INFO SNAPSHOT_CHECK Total chunk size is 1251G in 320761 chunks

bkeeper · 18 June 2020 00:46

I’m using wasabi us-east-1

bkeeper · 18 June 2020 00:50

2020-06-17 18:46:55.515 TRACE LIST_FILES Listing chunks/
2020-06-17 18:47:39.364 TRACE SNAPSHOT_LIST_IDS Listing all snapshot ids
2020-06-17 19:18:40.589 INFO SNAPSHOT_CHECK 13 snapshots and 16301 revisions
2020-06-17 19:18:40.592 INFO SNAPSHOT_CHECK Total chunk size is 427,554M in 105613 chunks

Maybe it’s the number of revisions.
But that would only affects the second stage and listing still takes 31 minutes.

Droolio · 18 June 2020 01:39

Yea, listing snapshots, not chunks. The optimisation above, for Google Drive, S3 et al, is for listing chunks - which was very time-consuming.

As far as I can see, that doesn’t appear to be your problem here, as yours is done in ~44 seconds, before going onto listing snapshots - that seems to be the time-consuming part, which is rather weird.

gchen · 18 June 2020 01:48

Between these 2 lines, Duplicacy loaded all revisions one by one and made sure that every referenced chunk was in the list of existing chunks. Because you have such a large number of revisions (16302 of them), 31 minutes is very reasonable.

bkeeper · 26 June 2020 03:20

Fair enough.

The truth is I have never pruned or deleted any snapshots.
I think I can delete lots snapshots
So it’s time to prune a bit.
Thank you all.

My problem is that the prune command is a bit limited in the web-ui
I have posted a new topic with a web-ui UX request for custom cli commands.

Render and customize cli-command to run

tangofan · 26 June 2020 04:47

Actually it may seem limited, but it’s really not. You can just ignore the proposed retention options and paste the full options into the “Options” field.

And when editing a prune job, you anyway have only the option field to deal with.

bkeeper · 26 June 2020 05:31

Interesting, so it would be something like:

-id test-id -keep 30:360 -keep 7:180 -keep 1:30 -threads 8

again -does the parameter id work for example?.

Are there any restrictions on cli options?

If so we should really change the descriptions or the web ui guide.

Will try. But I didn’t wanna guess with prune.

Thank you

tangofan · 26 June 2020 07:12

Yes, something like that. The options are just passed to the CLI, so I don’t see why -id wouldn’t work.

Well, you can always use the -dry-run option. Assuming that one works, of course

bkeeper · 26 June 2020 16:48

Exactly, that is precisely the question.

leerspace · 27 June 2020 11:30

Yes, -dry-run will work as an option.

leerspace · 27 June 2020 11:33

@gchen when is the next release that will include this commit planned for?

gchen · 27 June 2020 17:47

I’ll release a new CLI version next week.

jeffkjr · 6 July 2020 16:21

I can confirm this latest version fixes my original issue. Listing all chunks now runs in 15 mins on 15.5TB of data on GCD with 10 threads. Thank you!

2020-07-06 01:00:01.698 INFO STORAGE_SET Storage set to gcd://backup
2020-07-06 01:00:04.038 INFO SNAPSHOT_CHECK Listing all chunks
2020-07-06 01:15:08.902 INFO SNAPSHOT_CHECK 6 snapshots and 114 revisions
2020-07-06 01:15:09.034 INFO SNAPSHOT_CHECK Total chunk size is 15548G in 3243443 chunks

system · 16 July 2020 16:21

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.