Confused with prune retention policy

Just to be clear, this is how I want the backups to be taken for the definition 2 posts above. (Forget the hourly for now.)

Whether one wants to take a monthly or 30 days apart snapshot depends on use cases, so I don’t really care which way it works but to be honest, it would be better if both cases can be specified.

What I’m wondering is that, maybe I’m still confused but if I specify the prune option as 2 posts above, will it keep the backup on ‘July 31st’ in the image? Because it seems that -keep 30:70 may mean, ‘keep 30 days apart backup for backups older than 70 days’ and might exclude that day which is what I want to keep as part of the monthly backups.

I think the restic policy would achieve doing this (but since it’s uselessly slow on restore operation on small files, I don’t want to use it, though they seem to be trying to fix it.)

The thing is that it’s only possible to answer that question under certain conditions, the most obvious one being that you actually made a backup on 31 July. Also, if you made a backup on that day, I think it would have to survive more short term pruning. I’m not sure how exactly to predict that.

So it seems to me that part of the confusion might be due to your trying to translate a system of relative times (and moving boundaries) to absolute dates…

But let’s look at the command you are referring to and figure out what it does. (I’m also doing that as an exercise for myself, as I can’t say I have fully grasped this yet in practice):

So what will duplicacy do here? First, it checks for any snapshots older than 365 days and deletes them. That part is easy.

Then it checks for snapshots older than 70 days (which in this case means: older than 70 days but younger than 365 days.). Of those, I suppose it keeps the youngest, then it looks at the second youngest one: if it is less than 30 days away from the youngest one, it is deleted. And so on until it finds one that is 30 or more days older than the youngest one identified above and it keeps that one. It will do the same for another 30 day period, and so on until it cycled through all snapshots older than 70 days.

Assuming that when this command was started, there was one snapshot for every calendar day, and today is 19 August, the above concerns snapshots from before about 9 June. So in contrast to your image, I’d say that it
will keep the 9 June snapshot but not the one from 31 May. It would keep 9 May rather than 30 April etc.

But this applies only if we still have snapshots for these days. If you previously ran the command on 31 July, the 9 May snapshot would already have been deleted so it would end up keeping 31 April because that’s the one that’s left. And so on. In other words, it matters on what date you prune for the very first time.

Okay, but let’s continue the command. Next up is -keep 7:14. Now we’re looking at the snapshots older than 14 days (but younger than 70). We keep the youngest (which is the one from 5 August). The next one to keep is the one from 7 days earlier, so from 29 July. And so on. So pretty close to what we see in your calendar. (And the difference may be my fault. I probably shouldn’t keep the one from the 5th but rather the 4th. Doesn’t matter.)

Last step: Snapshots older than 1 day (but younger than 14 days). We keep one per day, which is exactly what your calendar shows.

1 Like

Nice “debug” @Christoph! :wink:

I think this is one of the main causes of confusion here in this topic.

Here is one point I have doubt about the prune command. I understand that the command will continue to search for revisions, even within the “boundary” of the previous larger “-keep”. And “it may” be that it finds a revision that fits the criteria of the second “-keep”, and erases it. I don’t know if it “respects” the previous keep “boundary” …

1 Like

Thanks for the detailed response.

Yes, I was assuming the ‘hourly’ backup has been continuously happening for the entire duration covering the retention policy.

Yes, this is true. I was spoiled into thinking the retention policy by relative dates through the use of utilities like rsnapshot, some cloud provider like Linode gives you policies such as ‘weekly backup every Tuesday and daily backups for the last few days’ (which is more advanced than restic as you can define which day of the week instead of getting pegged as the last one of the week, which probably means Saturday) for their server instances and recently since restic also had similar approach into defining the policy.

While I understand the flexibility given by having only days as the unit to specify (with less complication for implementation), when it comes to dates, it may not work too well for people who would expect such relative date policies.

I can certainly write my own custom script to do the retention however I like (while I can’t do anything about restic’s slow restore unless I get to patch its code myself), I’m wondering what the author’s take is on the future of duplicacy’s retention policy. Personally, I feel it has a room for improvement even if it stays with absolute day policy only, such as when it isn’t immediately obvious that the ordering of -keep is important and how each -keep interacts with each others by checking the doc on duplicacy help prune.

If I’m right, I think this is what will be kept with duplicacy by issuing this,

-keep 0:365 -keep 30:70 -keep 7:14 -keep 1:1

which is not too different from the relative date policy in terms of amount of backups but I think it is pretty hard to predict what gets left by just writing the option in the command line. (I wasn’t sure if ‘older than’ meant inclusive or not, so, some parts could be a day off.)

I think the current absolute date method probably cannot have overlapping backups between each -keep which may become a problem in some cases.

I assume in this case that while Daily keeps moving day after day following the last 14 days, the Weekly and Monthly will stay there until August 12th is cleared of Daily (on September 2nd) and be kept as a Weekly and triggers June 17th to be pruned.

For science, I just ran your retention line on one of my as-yet-unpruned repositories with the -dry-run option. (It has 6am daily backups for nearly a year.)

Snapshots that weren’t deleted include August 6th through 19th as expected. Plus the following…

2018-07-30
2018-07-23
2018-07-16
2018-07-09
2018-07-02
2018-06-25
2018-06-18
2018-06-11
2018-05-22
2018-04-22
2018-03-22
2018-02-20
2018-01-21
2017-12-22
2017-11-22
2017-10-23

I can’t make head nor tail of it either. :slight_smile:

Although actually, the weekly dates make sense to me, I don’t quite know why it keeps the monthly snapshots starting on May 22nd.

Is it because it’s roughly 3x 30 days from today, rather than from 70 days ago? If so, the 70 days seems to be just a qualifier for snapshot candidates to delete, and not an epoch by which differences in dates are calculated. That would also mean it doesn’t keep the ‘oldest’ or ‘youngest’ within the retention period, but the periodic snapshots relative to today’s date, or more importantly the date when prune was first run.

What I also know is that -keep options are independent from each other, and they don’t interfere with each other’s range. You should be able to run them individually one after the other.

Personally… I can understand the -prune syntax enough to appreciate its flexibility, and it doesn’t bother me too much that you can’t align it perfectly to a calender.

Does it really matter ‘monthlies’ fall on a 1st or 30th/31st or in the middle of a month? Or that ‘weeklies’ fall on a Monday or Sunday or Wednesday. Unless my workflow means that certain data is supposed to exist in a monthly or weekly on a specific weekday/day of the month (unlikely), I don’t think it matters a huge deal imo. :slight_smile:

2 Likes

Thanks for the test.

Perhaps the reason Monthly starts from May 22nd could be that the retention is being calculated from the furthest time instead of the most recent and that your backup may have started around 22nd of about a year ago?

Do you mind running a --dry-run with this to see if Monthly would overlap within the Weekly?, which should have 2018-06-21 appear as part of a Monthly within the 7:14 period but I feel this will only shorten the Weekly down to older than 14 but younger than 30 days duration, so I think it would be influencing each others indirectly.
(30:70 -> 30:30)

-keep 0:365 -keep 30:30 -keep 7:14 -keep 1:1

Also, something small but how come the Monthly snapshot gets taken on 2018-03-22 but the next one is 31 days apart at 2018-04-22?.

Frankly, I feel the same as long as the amount of backups taken would end up being similar but the complication here is the predictability of what kind of backups you’d end up having. For new users who try to build up a retention policy, it’s not so easy to express it using the current parameters, as this thread kind of illustrates…

Somewhat ontopic here: I remember reading somewhere that duplicacy prune takes into account the hour when prune is run. so if you take hourly/minutely backups -> if you rune prune in the morning or in the evening will matter (regarding what is deleted). I remember @gchen and some other smart people on github (maybe fracai?) saying that should be fixed.

More than the monthly ones anyway, but if I say “keep a snapshot every 7 days for snapshots older than 14 days”, I would expect it to start by first keeping the youngest snapshot that is older than 14 days and then go back another 7 days and keep the next one. But your results show that it omits this first step (since it doesn’t keep one for 5 August).

You’re right. My revision 1 was ran on 2017-09-23 and, if I rename the snapshot file to 1.bak and re-run the prune, all the revisions shift by one for just the -keep 30:70 option. May 23rd is kept instead of 22nd.

What might be useful right now is to add a bit more debugging to the code to include the snapshot timestamp and which -keep n:m option is responsible for each deletion…

Just guessing without reading the code but perhaps it calculates it to start the Weekly on August 6th (by being inclusive from 19th) but since it has a Daily on it, Weekly could be simply overlapping on that day.

1 Like

Should consult the code or the author but I find it odd that if the Weekly (7:14) determines its pattern from the most recent days to be on Monday (as oldest 2017-09-23 is Saturday but it’s not pegged to that day), Monthly (30:70) is being pegged from the oldest day instead.

Now that I think about it, it’s probably calculating everything from the oldest day as after 30:70 stops at May 22nd as the test shows, it would enter 7:14 and jumps 7 days from there and if being inclusive next candidate would be May 28th which is Monday and then keeps picking up Monday but since 30:70 has ‘eaten’ everything up until 70th day (which is around June 10th), 7:14 would pick up from June 11th until 14 days ago (again, being inclusive to be August 6th) and rest are 1:1.

Yes, when a new retention period starts, the first snapshot is always kept and used as the base. Then subsequent snapshots are checked one by one. If a snapshot is too close, then it will be deleted. Otherwise it will be kept and become the new base.

One issue with the current implementation is that actual times of the day that snapshots may vary a bit (as well as the time at which the prune is run). As a result a few seconds in the time difference may cause a completely different set of snapshots to be deleted (for example this issue: Impact of Prune on storage Copy operations). I think the solution is to consider only the days when comparing the time different between two snapshots.

by “first”, do you mean youngest or oldest?

And could you explain

This is a good point, by my understanding is the youngest.

A post was split to a new topic: Prune retention policy for Amazon Glacier

im confused with it as well…more for what happens with a customer of mine

they have a sub folder off of the root folder we will call it jobs
they work on a job and then when its done its moved to another sub folder called jobs complete…different sub folder from root
in b2 i think (don’t know) but think the files are backed up twice…
how do i go about pruning files that have been moved…
i want to be careful not to prune something that was deleted or changed by accident under the root in the other folders… if that even makes sense

So what is it that you don’t understand?

Pruning is not about removing individual files. It’s about removing revisions from snapshots.

I just submitted an update today for duplicacy to add -keep-max to the prune command. I don’t know if/when gilbertchen will review and/or whether or not he’ll merge the code into a future duplicacy version, but you can read about it here if you’d like:

Duplicacy prune -keep-max option