Check -chunks batching

Ithilion · 18 December 2021 19:02

Hi, happy user of duplicacy for many years now. Checking in today because i have a suggestion: i recently added an unlimited cloud storage with free egress and i would like to leverage that by adding a periodic check -chunks. Now, we’re talking TBs, so it’s not feasible to do the first check all at once, but checking a few GBs along with every backup (nightly backups) would add up to a significant amount in just a few months. So, would it be possible to limit the check command to say, either a fixed number of chunks or amount of data?

saspus · 18 December 2021 19:13

Why do you need to check chunks in the first place? If you don’t trust your storage provider to maintain data integrity — change storage provider asap.
Duplicacy keeps a log of chunks that were checked and if interrupted will resume when ran next time. So it’s already working today. Schedule to run it daily. Eventually it will plow though entire datastore and will effectively be checking only newly uploaded ones.

Ithilion · 18 December 2021 19:19

It would just be to assure that the data has been transferred correctly in the first place, i don’t plan to delete the verified_chunks file periodically to recheck chunks again.
Yes but how do i instruct the check command to stop? If not stopped it would continue running until all chunks are checked no? I assume i would have to setup some script that actively kills the check command, which seems kind of a dirty hack. Having “a maximum number of chunks” option would be the cleanest imo.

saspus · 18 December 2021 20:39

Ok, this is not a bad idea; api related issues are not unheard of.
Why do you want it to ever stop before it’s done checking? What is the benefit or not running the check or running it in bursts? I’m wondering bandwidth management? Or time of day - e.g run it nightly? Limiting by number of chunks won’t accomplish that.

What would be useful perhaps is indeed time of day schedule where to avoid use of bandwidth.

Including for backup — for example, keep hourly snapshots but upload them during specific window where bandwidth usage is allowed. I don’t think any other backup tool does it today fully.

Ithilion · 18 December 2021 20:54

I would like to stop it because i don’t want it to continuously run for weeks to check the multiple TBs of data present in the storage. The idea is to run it along with the nightly automated tasks i have already set up, and make it check a small subset of the remaining chunks every time, and eventually it would catch up and have the entire storage checked (plus also all the newer chunks subsequently uploaded). In this context, an optional setting to the -check command that stops the execution after n chunks have been checked seems like the cleanest implementation to me.
for example: check -chunks -chunk_limit 5000 ← every time this runs, it would only check 5000 chunks and then terminate.

saspus · 18 December 2021 21:46

I understand what you mean, but number of chunks is not a usable metric and is in fact an implementation detail. Users don’t (and should not) think in terms of number of chunks. They should not know they even are a thing.

Based on what you described you actually may want these two things instead:

limit amount of data egressed
limit duration of the check or confine it to the specific time windows.

How to compute number of chunks that needs to be specified to fulfill either of those goals is either very hard or impossible to figure out.

I would suggest adding -max-egress 500Gb -max-duration 4h instead.

This will actually be helpful for restore as well — some remote have tiered pricing where you may benefit financially by restoring slower in smaller batches.

It does not matter if it’s easy or hard to implement and the ease of implementation should never drive the feature set. I strongly believe that the user experience is paramount, developer suffering is irrelevant. (No offense to the duplicacy developer; I’m developer myself)

96e2ad680c5a2575bf90 · 3 June 2024 13:01

Well … limiting the duration and the amount of data seems reasonable as size of chunks could vary.

Indeed, I’d really would require some options like that, too,
as I have to move a multi TB backup from one provider to another
with a final check for correct transfer.

Certainly, i could do a hack and stop checking after e.g. a certain time,
but I’d prefer to have some options available that puts some
restrictions on copy and check … maybe even on the restore.

BTW
on windows there seems to be an issue with stopping the check process
from the UI. If you do so, verified chunks will not added to the verified_chunks
file. So when started again, it starts checking from the very beginning in case
of a new repo So when started again, it starts checking from the very beginning
in case of a new repo … which is pretty bad in case of a multi TB repo