Caching of chunks listing

sevimo · 26 July 2022 18:09

I was wondering if it is possible to cache chunks directory listing. Sometimes I want to run several consecutive jobs against the same storage, and just about any single job starts with re-listing all the chunks in (remote) storage, and with millions of chunks this step can take quite a while. This is especially bad when the actual operation does almost nothing, which would be common for daily operations.

It would be great if subsequent operations wouldn’t need to re-list all the chunks again as an option.

Thoughts?

saspus · 26 July 2022 20:31

This would break concurrency scenarios (backing up with multiple machines to the same storage). Maybe it can be done with an additional -exclusive mode but it will be adding additional opportunities for failure for no benefit in return: backup, let alone backup maintenance is low priority background task — it does not matter how fast it is as long as it completes between invocations, which is always achievable for maintenance tasks.

sevimo · 26 July 2022 23:18

Not really, at least it shouldn’t. should already handle cases like this as concurrent operations can happen at any point. If you listed all the chunks at the beginning of operation there are absolutely no guarantees that this listing will stay like that for the remaining duration of the operation, it doesn’t lock storage. No -exclusive is needed because at worst different operation can upload the same chunk with the same content. Prune is different as it is the only operation that modifies the storage differently. So whether you read chunk listing at the beginning of operation from storage or from cache, there needs to be understanding that it is not necessarily the most recent state of the storage.

Having said that, I might be mistaken about re-reading the whole chunk list every time. It seems this is done on any first revision of a particular snapshot (and I’ve been making quite a few of those for testing), yet on subsequent backup runs the full listing is not executed, which cuts down on execution time quite a bit (full listing can take more than half an hour for 2-3TB repository). Check operations still need the whole listing, but that’s fine as there are probably not that many use cases of running multiple check commands against the same storage.

saspus · 27 July 2022 01:35

Correct, by checking for the presence of a specific chunk on a target, not caching its existence. It only caches contents the chunk: once the chunk is verified to be present on the server, and is cached locally – download can be avoided.

This (check for chunk presence on the target) is what makes prune (especially exhaustive one) so ridiculously slow with some remotes. But there is no hurry with prune – so it’s a non-issue.