Understanding pruning

zizheng.tai · 25 August 2019 17:29

Hi all, I have a concept question about pruning. I read the prune command details post and searched the forum, but they don’t answer my question.

Say I have a repo that has three files inside, a, b, and c. I run the first backup of my repo with the following filters file:

-b
-c

This should create a snapshot/revision #1 that contains only a.

I then modify the filters file to

-c

and back up again. This should create a snapshot #2 that contains a and b, but only b is newly uploaded.

My question is what if I now prune snapshot #1? Will both a and b still be intact in snapshot #2, or will a be missing because the snapshot in which it was newly uploaded was pruned?

In a more realistic scenario, if I incrementally back up all my files by starting with a very restrictive filters file and expanding the allowed file list over revisions, by the time I finally have every file I want in a snapshot, can I safely prune all but the last revisions and keep all the files added in those pruned revisions?

Or, in another scenario, if I use the -k option to keep no revisions older than say 30 days, will the files added in those pruned old revisions still be present in the latest revisions?

Droolio · 25 August 2019 22:00

This really comes down to an understanding of what a snapshot is…

Every file that was listed in that snapshot will be kept if you prune all prior revisions. Even if those files weren’t modified and incrementally backed up (uploaded) in the latter snapshot at that time, they’re still included as part of the snapshot.

As an example, say you have in your example a, b, and c with c filtered out.

If you modify no files and run a backup, you’ll create another snapshot (with a and b included). No new chunks were uploaded but you still created a snapshot. Let’s run another backup. A new snapshot is created, again no modifications were made. Now you prune all but the last snapshot. That snapshot still includes a and b (as they were when the snapshot was taken). Hopefully, this should answer your second and last questions.

zizheng.tai · 26 August 2019 00:30

I see, so as long as a file is conceptually included in a snapshot X, no matter in which other snapshot the file was actually uploaded, it will always be in X no matter what I do with other snapshots. Is that right?

Droolio · 26 August 2019 09:29

Yup! It wouldn’t make much sense for a snapshot to include only the files that were newly uploaded. The reason a snapshot can include all prior files is because of de-duplication. The chunks that represent the data and metadata have already been uploaded, so it doesn’t take any extra space to reference them again.

Arty.R · 26 August 2019 16:55

I also have a question about this command
If i set option
-keep 0:360 -keep 7:90 -keep 30:180
keep 0:360 means “no snapshots older than 360 days”

FIles older 360 days also will be removed from backup ?

gchen · 26 August 2019 18:34

It means backups older than 360 days will be removed. The prune command only work on backups as a whole; it doesn’t look into individual files contained in each backup.

zizheng.tai · 26 August 2019 18:56

In other words, as long as a file is included in a snapshot within the latest 360 days (i.e. not pruned), it will be kept, right?

gchen · 26 August 2019 20:56

That is correct…

Arty.R · 27 August 2019 05:48

If I understood correctly.
I have a directory where part of the files was deleted.
An archive older than 360 days will no longer have deleted files?
And if I want to be able to recover files deleted more than 360 days ago, the prune option should look like this?

-keep 1:360 -keep 7:90 -keep 30:180

towerbr · 27 August 2019 12:59

Just a little nomenclature correction to keep users aligned:

“revisions older than 360 days will be removed”

Details:

Arty.R · 27 August 2019 16:53

Ok let it be revisions )
but my question still here - if i delete file from folder more than 360 days ago - it not be included in backup ?
with prune scheme
-keep 0:360 -keep 7:90 -keep 30:180

and i need this
-keep 1:360 -keep 7:90 -keep 30:180

TheBestPessimist · 29 August 2019 12:07

Is wrong: you have to have -keep n:m ordered by m decreasingly: Prune command details.

So the correct way to -keep would be

-keep 0:360 -keep 30:180 -keep 7:90

Warning, i could be talking crap here!

-keep 1:360 is understood as: keep a revision every 1 day, for all the revisions older than 360 days.

So you do all the previous prunes (-keep 30:180 -keep 7:90), and this final -keep will store all the revisions that were left untouched.

In the end, under these 3 -keep conditions, it means that after 360 days you would have 1 revision stored forever every 30 days.

Droolio · 29 August 2019 13:59

Yup, there’s literally no point putting that -keep 1:360 in there - it does nothing, as the prior -keep 30:180 rule is the biggest interval. You need to choose a bigger interval or 0 to keep nothing. (While you have to order them by m/age decreasingly, the n/interval only makes sense if they decrease in order too.)

Droolio · 29 August 2019 14:12

I kinda get what @Arty.R might be trying to accomplish…

CrashPlan had this option where you could tell it ‘never remove deleted files’ which I guess meant old revisions would get pruned but there’d be at least one copy. The issue is, Duplicacy can’t do this because pruning is done on a snapshot level.

Personally, I don’t think that’s a problem. CrashPlan’s feature was iffy at best and, from a design perspective, hard to implement and the extra storage requirements could get crazy.

What defines ‘a file’ exactly? What happens if that file is renamed? You’d have to properly detect file deletions at the least. CrashPlan couldn’t always do this, hence it had a (sloooow) verification scan that ran daily.

You accidentally rip a 4K Bluray onto your desktop and it gets backed up - you didn’t want that, but how do you remove it from the backup? CrashPlan had a way to do it, but it was quite fiddling compared to just being able to delete a snapshot.

Droolio · 29 August 2019 14:24

One thing you could do with Duplicacy is set up a special repository - a recycle bin of sorts (maybe the actual recycle bin?). Deleted files would go in there and be backed up. De-duplication would take care of the disk space.

So long as they continue to be referenced by at least one revision, it doesn’t matter that they get removed from the other repositories or even from the bin repo after a backup. So if you don’t prune that repository, those snapshots (files) can stay in the backup storage.

Now the slight issue is getting Duplicacy to omit that one repository when doing a prune. Sadly, you’d have to prune all other repositories one-by-one instead of using -all. That’s a lot of extra work. Maybe the parameters could be improved to allow multiple -ids, exclusions etc.? Or perhaps different retention periods applied directly to each repository - saved in the preferences file, like we discussed for the set rate-limit options.

Arty.R · 30 August 2019 14:06

Now I’m completely confused))

Then I ask you to tell how this command should be written so that I can return the deleted file, for example, two years ago?
And in this case, I need to have a daily copy for three months and a copy once a week older than three months

Droolio · 30 August 2019 14:42

Duplicacy currently cannot guarantee to keep deleted files forever if you prune by retention policy (-keep) - not unless you employ some trick like I mentioned in my last post.

At the end of the day, the thing you need to understand is that backups are stored in snapshots of files.

Pruning deals with removing snapshots, not individual files. Thus deleted files merely no longer appear in subsequent snapshots. And pruning all earlier snapshots will remove all history of such deleted files.

-keep 7:90 -keep 1:1

Edit: This rule will keep weekly snapshots forever, the only way to return a deleted file from years ago.

Arty.R · 30 August 2019 14:49

I understand correctly that in this option all snapshots will be saved ? But those older than 90 days will go once a week.

-keep 7:90 -keep 1:1

akvarius · 31 August 2019 00:21

I think @Droolio has it right, and -keep 7:90 means keep one revision (of snapshot) per 7 days, so you will have one snapshot (revision) a week for snapshots older than 90 days. Younger snapshots are saved except if you add more -keep x:y options with smaller y.

I think -keep 1:1 is only necessary if you make multiple backups per day and only want to keep one per day. If you make one backup per day you can skip that option, since it will not make your younger snapshots more safe. (They are already safe for 90 days)

The important take-away from this is what @Droolio says, and which is a bit unfamiliar compared to CrashPlan:

This needs to be said strongly so users doesn’t get surprised:
When you start using prune, you will lose some deleted files and original versions of changed files.
This means that -keep 7:90 might be too aggressive for your need.
If 90 days is enough time to discover a need and restore a file, then you are OK!
If you rather want to give yourself a one year safety margin, then -keep 7:365 (for instance)
If you can’t afford to lose any originals then you should probably not prune.
(All of this is subject to user cases of course, including storage limits too)

Please don’t get me wrong, I love it and the snapshot-based design is what makes very elegant, clean and tidy and fast!

Arty.R · 31 August 2019 12:20

is it necessary to have
-keep 1:1
in keep options ?

Only this
-keep 7:90
not means this ?:
Keep 1 snapshot every 1 day(s) if not older than 90 day(s)
Keep 1 snapshot every 7 day(s) if older than 90 day(s)