Prune Default ID and other Prune Questions

jnkmail11 · 3 June 2019 23:55

I’ve read the prune command details post as well as the quick-start CLI guide and have some questions I’m hoping someone can help me with. In particular:

1.) It sounds like ‘duplicacy prune’ does not delete any snapshots, but is used to clean unused chunks and fossils. Is this correct? I’m confused because on the prune command details page it says the -id option deletes “snapshots with the specified id instead of the default one,” which runs counter to this impression. Also, what is the default id?

2.) Assuming the purpose of ‘duplicacy prune’ is to clean unused chunks and fossils as mentioned in 1.) above, will ‘duplicacy prune -keep x:y’ also do this if run enough, or do I need to eventually run just ‘duplicacy prune’ to get rid of those chunks and fossils?

3.) I’m trying to gauge how often I should prune. Originally I was planning to run it after every backup, which now seems naive based on forum comments about how long pruning takes. It seems like the prune command is run locally in that the data must be downloaded to the client first for processing there, and I’m guessing that’s the bulk of the overhead. Can anyone give me a sense of what % of the total used remote storage capacity has to be downloaded to the client for pruning?

Thanks!

TheBestPessimist · 4 June 2019 05:06

The first line in prune details sais the following:

In terminology we are using both revisions and snapshots to refer to the same thing: all the files created when you run the backup command (i will refer to this as a revision from here on). This naming has always been a constant source of confusion (even for us, staff), i know .

What the -id parameter does is set a specific snapshot (a snapshot is the whole set of backups (revisions) of a particular repository(repository = folder on your local computer) ).

So let’s say you have 2 folders that you wanna backup C:/work and D:/misc. You will init a repository in both of these folders, and when you init you are asked to also provide a unique “snapshot” name. This is the name by which knows which folder you wanna back up, it doesn’t care about anything else except for this “snapshot” (also found as “snapshot id”, hence the -id option for every command).

What prune does firstly is check which of the existing revisions (for each individual snapshot) is ready to be deleted (pruned). If any revisions are found, they are only marked for deletion. The next time prune runs it will find these marked for deletion and really delete them.

then continues to look again for other revisions which can be deleted, and only marks them – step 1 all over. This is why pruning is done with a Two-step fossil collection algorithm.

If you only have a single repository, then allows you to not give a snapshot id, and it provides one for you, called default (that comes as a shock right? ). This is what uses when it refers to “default snapshot” or “default snapshot id”.

Since you read the whole prune #how-to, you should have also read the section: Only one repository should run prune.

The default you can use, depending on how much storage you want to use for old revisions is

duplicacy prune -all -threads 30 -keep 0:360 -keep 30:180 -keep 7:30 -keep 1:7

and you only run prune from a single repository.

I think pruning once a week, or once every 2 weeks is enough, but again: this depends on how often you do new backups (new revisions). Pruning time is very dependent on 1. the number of revisions existing in *all the snapshots and 2. the number of files which exist in those backups.

Again: this depends on the size (files and GB) of you backups, so this may vary greatly. imo though you should never ever need to download more than 5% of the total size of the backup (plus this data is cached afterwards), and in my case i don’t think has ever downloaded 1GB of data for my biggest repository (1.3TB).

doesn’t download the data chunks, it only downloads the meta-data chunks (those which store information about the backup) so it knows which, what, how and who to prune.

jnkmail11 · 4 June 2019 09:05

Thanks TheBestPessimist, that was very helpful. Indeed, I had some terminology confusion regarding snapshot vs revision and this cleared it up.

Thanks for mentioning this. Previously, I thought I correctly understood that section to mean don’t run multiple prunes concurrently if the repositories share the same storage, but now that seems wrong. Just to make sure I understand, is the repository that is running prune the one for which the current directory belongs? Thus, if I always run prune from the same directory even for different repositories, then I should be fine?

Now the purpose of the id option makes sense. Before, I was specifying the snapshot id by changing the current directory to each repository’s before running prune. Hopefully this didn’t mess anything up – I’ve only been doing it for a few days. Would you recommend remaking my repositories if check shows no problems?

Thanks!

TheBestPessimist · 4 June 2019 09:11

Yes, that is ok. Although: just run prune once with -all. That’s the most efficient way to do it.

As long as check sais everything is all right, there’s nothing extra you have to do.