Do different prune -keep periods interfere with each other

Continuing the discussion from Confused with prune retention policy:

Let’s take this part of the complex discussion into this separate topic:

Indeed, it would be good if @gchen could clarify.

I think we need to clarify what “independent” means here. Because if it means what you suggest here (that running multiple keep options in one command is the same as running multiple individual commands after each other) then this actually means that the different time periods do not respect each others boundaries (as @towerbr suspects), which would mean that they very well may interfere (because they run independently).

However, such interference won’t happen in practice as long as your keep options always specify a lower snapshot frequency for older snapshots. For example, -keep 7:30 -keep 1:14 should be fine, but not -keep 1:30 -keep 7:14

If duplicacy doesn’t already throw a warning in the latter case, I think it should do so.

I think I got the logic (my best guess) but it should not directly interfere with each others but since -keep must be specified from larger span, earlier parameter has already wiped out the backups in wider range before the smaller spans and thus the smaller spans would have nothing to do within the previous ranges and as a result, later parameters have effect only within their specified ranges.

You’re absolutely right.

As as quick test, I ran -keep 1:30 -keep 7:14 -dry-run and it only deleted 15 snapshots of my year-long supply of backups. With just -keep 7:14, it deletes 267 revisions. -keep 1:30 alone deletes just 5 revisions (seems some of my backups were ran ad-hoc throughout the day) but all 5 feature in the combined prune of 15 deletions.

So if you do run them individually, make sure the interval:age makes sense (-keep 1:30 -keep 7:14 definitely doesn’t make sense :slight_smile: ). While they need to be sorted in decreasing order (age), the same rule should probably apply to the interval. With 0:age being an exception.

I’m quietly confident however, if you have a well ordered bunch of -keep's, you can run them individually in any order and the end result will be same if ran combined.

2 Likes

Good point. I suggested a warning, but there isn’t really any reason not to enforce this.

No. See my answer in the other thread:

Yes, when a new retention period starts, the first snapshot is always kept and used as the base. Then subsequent snapshots are checked one by one. If a snapshot is too close, then it will be deleted. Otherwise it will be kept and become the new base.

2 Likes

So just to confirm my understanding (of the two topics).

If I have a backup running daily in the last 20 days then I will have the daily revisions:

01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 (today)

If I run (a unique combined execution):

duplicacy prune -keep 0:18 -keep 5:10 -keep 2:5

I will have for each “keep”:

-keep 0:18

      03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 

-keep 5:10

               06             11 12 13 14 15 16 17 18 19 20 

-keep 2:5

               06                12    14    16 17 18 19 20 

OR the last one will kill the “06” revision too?

			   
-keep 2:5

                                 12    14    16 17 18 19 20 

Why would it do that?

Because “06” is older than 5 days and has a “step” of 2 days.

In other words, the prune command will continue to check all “2-in-2” snapshots older then 5, even those already marked for deletion, or will stop at the next “step” without valid snapshots (the step from 12 to 10) because it will consider “ok, 10 is already marked for deletion, let’s stop here”?

EDIT:

I see now, you’re right, I made a basic mistake with even and odd, allow me to reformulate and put in a more direct way:

If I have this set of revisions:

             05                    12    14    16 17 18 19 20

And execute

duplicacy prune -keep 2:5

It will kill the “05” revision? Or will stop at “09” revision, with “ok, 09 is already marked for deletion, let’s stop here”?

I was taking a quick look at the code at duplicacy_snapshotmanager.go and this is still not clear to me.

I haven’t looked at the code, but my reasoning was that 05 will not be deleted because the rule is to keep one snapshot for every two days (not to delete snapshots from every other day).

If (for whatever reason) you would also have snapshot 04 or 06, one of them would be deleted (and I believe it would be the older one of the two that would get deleted), but not in your scenario.

I wouldn’t be able to predict, however, what would happen if you had all three (04, 05, and 06) left. I’d say the optimal behaviour would be that only 05 would be deleted and 04 and 06 would be kept. But I’m not sure duplicacy is smart enough to guarantee that. Chances are it ends up deleting 04 and 06 while keeping 05… (not a big deal though, since it’s not a realistic scenario anyway).

The rule is to keep one snapshot every 2 days for snapshots older than 5 days, and 04, 05, 06, etc are older than 5 days.

100% agree!

My doubt remains: Duplicacy will scan all revisions until the 1st, and when going through the 05 (or 04, or 06, whatever, within the time frame of the previous -keep) will evaluate: “this one is not in the ‘every 2 days rule’, lets delete it”? Even if it has already been scanned by the previous -keep?

The main point here is: if you don’t set well the time frame of successive keep's on the same prune command, you might think: “okay, I’m keeping one backup every month” (eg) but actually that “one” backup in a month of last year can be erased by a -keep parameter that was thought for the recent weeks, eg.

I see what you mean but even if you say “every two days”, that doesn’t imply that you blindly delete every other day but semantically “keep a snapshot every two days” means that there is supposed to be a snapshot every two days, which means you can only delete a snapshot if there is one on the day next to it (and I suppose that is exactly what duplicacy checks before deletion). Another way of saying this is that after pruning, the remaining snapshots should be two days apart.

So I see no way how duplicacy would delete 05 in your scenario. If it did, it would be a serious bug. But, of course, it’s better to double-check with @gchen

Ok, understood what you are describing. So, if for any reason we have the revisions like this:

          04 05 06                  12    14    16 17 18 19 20

and the option -keep 2:5, it will delete the 05, as it will take the 06 as a reference and apply the rule of “2 days step”, keeping 04 and 06.

Actually, I think Duplicacy starts from the first revision in its list (04 in your example) and works forwards, not backwards…

For each revision, it determines which retention period it’s in (-keep 2:5) and applies the rule.

04, being the first snapshot, should never get deleted unless a 0:x rule was applied beforehand. Then it evaluates 05 and sees that it’s within the 2 day interval and deletes it. At this stage 04 is still the ‘last kept’ snapshot, so when evaluating 06, that is also kept (2 days apart from 04) and [06] becomes the last kept snapshot for evaluating the time distance between further snapshots.

Without any other -keep, it should continue on to delete 17 and 19 in your example. Oops no, 17 and 19 are outside the retention age (:5) - they’ll certainly be kept!

2 Likes

Ok. It’s clearer now how it works. The reference revision is dynamic, not an specific initial reference.

This is a point on which I still have doubt, when seeing things like this in the code:

for id, snapshots := range allSnapshots {
	lastRevisions[id] = snapshots[len(snapshots)-1].Revision
}

If I correctly understood this loop above, the “last” revision would be number 1, since the array is being traversed backwards ("-1").

Isn’t that bit of code just making a copy of the last revision number for each repository? snapshots[len(snapshots)-1].Revision is indexing the last element of the snapshot array, the -1 is because arrays start at index 0 (so if you have 5 elements - 0 to 4 - to get the [4]'th and last entry, you find the length of 5 and subtract 1). I think in this case, it’s just to make sure a fossil collection doesn’t reference an old snapshot that doesn’t exist.

The reason I say it starts iterating from the oldest snapshot (revision 1 if it exists or otherwise 04 in your example) is because if you rename the snapshot file out of the way (4.bak), the affected snapshots differ by one. I tried this with a -dry-run and compared the logs and it does seem to be the case.

Also, the relevant code has a debug “SNAPSHOT_DELETE” line within the main loop. If you run Duplicacy with -d -log, the revisions go in ascending order.

Edit: in fact a -v is enough to see the SNAPSHOT_DELETE line.

2 Likes

Indeed … so I think this solves the doubt … Thanks!

In case anyone wants to test this further, I wrote a little PowerShell script to create a dummy set of daily snapshots over X number of days.

All it does is set the system clock back <arg> days, creates a backup, steps forward 1 day, repeat.

Just init a blank test repository, cd to the repo dir and run <path>\duplicacy_prune-test.ps1 20 to create 20 day’s worth of backups. Then you can try out different prune commands.

$days = $args[0]
$day = New-TimeSpan -Days 1
Set-Date -Adjust (New-TimeSpan -Days -$days)

For ($i=0; $i -lt $days; $i++) {
	duplicacy backup
	Set-Date -Adjust $day
}
3 Likes