Confused with prune retention policy

prune

#1

I’m having a difficulty understanding how the -keep option in prune works.

What I’d like to achieve is, that I take a backup every hour continuously but prune them to have these kept,

  • Hourly backups for the last 24 hours (Examples all talk about daily pruning but is this possible?)
  • Daily backups for the last 7 days
  • Weekly backups for the last 5 weeks
  • Monthly backups for the last 12 months

I’m probably wrong but got it set up like this (except the hourly part).

duplicacy prune -keep 1:1 -keep 7:7 -keep 30:30 -keep 0:365

I suppose the document or the option could be a bit friendlier as the description seems pretty confusing to me.

Just for your information, restic has a human friendly format with the option where I could specify the above as,

restic forget --keep-hourly 24 --keep-daily 7 --keep-weekly 5 --keep-monthly 12


#2

This is what you want:

duplicacy prune -keep 0:365 -keep 30:35 -keep 7:7 -keep 1:1  

#3

Thank you for the quick response.

When you mention -keep 30:35, what does it mean by 35? Does it mean that it has to be 35 so that exactly 5 weeks have passed?
And does the parameter ordering matter when you have reversed them from the way I had written?

Also, I take it that there is no pruning on the hourly precision?
If it is not, is there a plan on adding it? As far as I can see, I couldn’t find any issues talking about it either in here or at github.

Thanks again.


#4

You should start by searching for your problem.
Most likely you will find the #how-to guides (there’s plenty).
In our particular case there is this one: Prune command details


#5

I missed the sorting part of -keep in the doc but it is still confusing after reading the manual several times.

The way I currently understand is that with this setting,

duplicacy prune -keep 0:365 -keep 30:35 -keep 7:7 -keep 1:1  

for the weekly part, I get to keep a snapshot every 7 days for backups older than 7 days but will it prune after keeping that for 5 times as specified in the first post? Maybe 35 means that the -keep 7:7 will not interfere with the previous -keep 30:35 and gets pruned after reaching 35 days?


#6

Did you take a look at the Prune command details? In particular this part:

Doing duplicacy prune -keep 30:35 -keep 7:7 is the same as doing duplicacy prune -keep 7:7 and then doing duplicacy prune -keep 30:35


#7

Yes, I’ve read that part many times, maybe I’m just dumb to get a sense out of it but I have a similar question but if I’m taking hourly backups continuously and want these to be kept,

  • Hourly backups for the last 24 hours
  • Daily backups for the last 14 days
  • Weekly backups for the last 10 weeks
  • Monthly backups for the last 12 months

is this how it should be written?

duplicacy prune -keep 0:365 -keep 30:70 -keep 7:14 -keep 1:1

I was wondering whether this will keep a weekly backup at the 7th day and monthly backup at the 35th day.

Also with the current -keep method, I don’t think I can let a monthly snapshot be kept 1 for each month as I can only specify an absolute number of days. (As in, having 30 days apart could end up having one from 1st and the next from 31st.)

Is it possible to implement the intuitive restic option for the retention policy?


#8

I’m no expert on this, but it seems to me that the argument makes more sebse the other way around: a one-per-month policy would be followed by keeping a snapshot from the 31st and one from the 1st of the next month whereas the 30 day policy makes sure that they’re 30 days apart, no?

Looking at Removing backup snapshots — restic 0.9.2 documentation, I don’t see much difference to duplicacy (rather, it looks like the duplicacy prune options have been inspired by the restic forget options). I notice that it says " --keep-daily n for the last n days" instead of just “keep one for every m for the last n days” as in duplicacy. Is that what you mean? I agree that this would be better. Edit: no wait, isn’t the duplicacy way of doing it more flexible because it’s more generic? In that case, maybe introducing the restic options as aliases would help?

I notice one important detail in the restic documentation, though: they specify that it is always the last snapshot that is kept, for example:

--keep-weekly n for the last n weeks which have one or more snapshots, only keep the last one for that week.

I’d assume that this is the case with duplicacy too (@gchen?) and if that is the case, we should make it explicit in the documentation.

This is also something that you can do in restic but not in duplicacy. In duplicacy you can just keep 24 in the last day and if you made hourly backups, these will we the ones kept.


#9

Just to be clear, this is how I want the backups to be taken for the definition 2 posts above. (Forget the hourly for now.)

Whether one wants to take a monthly or 30 days apart snapshot depends on use cases, so I don’t really care which way it works but to be honest, it would be better if both cases can be specified.

What I’m wondering is that, maybe I’m still confused but if I specify the prune option as 2 posts above, will it keep the backup on ‘July 31st’ in the image? Because it seems that -keep 30:70 may mean, ‘keep 30 days apart backup for backups older than 70 days’ and might exclude that day which is what I want to keep as part of the monthly backups.

I think the restic policy would achieve doing this (but since it’s uselessly slow on restore operation on small files, I don’t want to use it, though they seem to be trying to fix it.)


#10

The thing is that it’s only possible to answer that question under certain conditions, the most obvious one being that you actually made a backup on 31 July. Also, if you made a backup on that day, I think it would have to survive more short term pruning. I’m not sure how exactly to predict that.

So it seems to me that part of the confusion might be due to your trying to translate a system of relative times (and moving boundaries) to absolute dates…

But let’s look at the command you are referring to and figure out what it does. (I’m also doing that as an exercise for myself, as I can’t say I have fully grasped this yet in practice):

So what will duplicacy do here? First, it checks for any snapshots older than 365 days and deletes them. That part is easy.

Then it checks for snapshots older than 70 days (which in this case means: older than 70 days but younger than 365 days.). Of those, I suppose it keeps the youngest, then it looks at the second youngest one: if it is less than 30 days away from the youngest one, it is deleted. And so on until it finds one that is 30 or more days older than the youngest one identified above and it keeps that one. It will do the same for another 30 day period, and so on until it cycled through all snapshots older than 70 days.

Assuming that when this command was started, there was one snapshot for every calendar day, and today is 19 August, the above concerns snapshots from before about 9 June. So in contrast to your image, I’d say that it
will keep the 9 June snapshot but not the one from 31 May. It would keep 9 May rather than 30 April etc.

But this applies only if we still have snapshots for these days. If you previously ran the command on 31 July, the 9 May snapshot would already have been deleted so it would end up keeping 31 April because that’s the one that’s left. And so on. In other words, it matters on what date you prune for the very first time.

Okay, but let’s continue the command. Next up is -keep 7:14. Now we’re looking at the snapshots older than 14 days (but younger than 70). We keep the youngest (which is the one from 5 August). The next one to keep is the one from 7 days earlier, so from 29 July. And so on. So pretty close to what we see in your calendar. (And the difference may be my fault. I probably shouldn’t keep the one from the 5th but rather the 4th. Doesn’t matter.)

Last step: Snapshots older than 1 day (but younger than 14 days). We keep one per day, which is exactly what your calendar shows.


#11

Nice “debug” @Christoph! :wink:

I think this is one of the main causes of confusion here in this topic.

Here is one point I have doubt about the prune command. I understand that the command will continue to search for revisions, even within the “boundary” of the previous larger “-keep”. And “it may” be that it finds a revision that fits the criteria of the second “-keep”, and erases it. I don’t know if it “respects” the previous keep “boundary” …


Do different prune -keep periods interfere with each other
#12

Thanks for the detailed response.

Yes, I was assuming the ‘hourly’ backup has been continuously happening for the entire duration covering the retention policy.

Yes, this is true. I was spoiled into thinking the retention policy by relative dates through the use of utilities like rsnapshot, some cloud provider like Linode gives you policies such as ‘weekly backup every Tuesday and daily backups for the last few days’ (which is more advanced than restic as you can define which day of the week instead of getting pegged as the last one of the week, which probably means Saturday) for their server instances and recently since restic also had similar approach into defining the policy.

While I understand the flexibility given by having only days as the unit to specify (with less complication for implementation), when it comes to dates, it may not work too well for people who would expect such relative date policies.

I can certainly write my own custom script to do the retention however I like (while I can’t do anything about restic’s slow restore unless I get to patch its code myself), I’m wondering what the author’s take is on the future of duplicacy’s retention policy. Personally, I feel it has a room for improvement even if it stays with absolute day policy only, such as when it isn’t immediately obvious that the ordering of -keep is important and how each -keep interacts with each others by checking the doc on duplicacy help prune.


#13

If I’m right, I think this is what will be kept with duplicacy by issuing this,

-keep 0:365 -keep 30:70 -keep 7:14 -keep 1:1

which is not too different from the relative date policy in terms of amount of backups but I think it is pretty hard to predict what gets left by just writing the option in the command line. (I wasn’t sure if ‘older than’ meant inclusive or not, so, some parts could be a day off.)

I think the current absolute date method probably cannot have overlapping backups between each -keep which may become a problem in some cases.

I assume in this case that while Daily keeps moving day after day following the last 14 days, the Weekly and Monthly will stay there until August 12th is cleared of Daily (on September 2nd) and be kept as a Weekly and triggers June 17th to be pruned.


#14

For science, I just ran your retention line on one of my as-yet-unpruned repositories with the -dry-run option. (It has 6am daily backups for nearly a year.)

Snapshots that weren’t deleted include August 6th through 19th as expected. Plus the following…

2018-07-30
2018-07-23
2018-07-16
2018-07-09
2018-07-02
2018-06-25
2018-06-18
2018-06-11
2018-05-22
2018-04-22
2018-03-22
2018-02-20
2018-01-21
2017-12-22
2017-11-22
2017-10-23

I can’t make head nor tail of it either. :slight_smile:

Although actually, the weekly dates make sense to me, I don’t quite know why it keeps the monthly snapshots starting on May 22nd.

Is it because it’s roughly 3x 30 days from today, rather than from 70 days ago? If so, the 70 days seems to be just a qualifier for snapshot candidates to delete, and not an epoch by which differences in dates are calculated. That would also mean it doesn’t keep the ‘oldest’ or ‘youngest’ within the retention period, but the periodic snapshots relative to today’s date, or more importantly the date when prune was first run.

What I also know is that -keep options are independent from each other, and they don’t interfere with each other’s range. You should be able to run them individually one after the other.

Personally… I can understand the -prune syntax enough to appreciate its flexibility, and it doesn’t bother me too much that you can’t align it perfectly to a calender.

Does it really matter ‘monthlies’ fall on a 1st or 30th/31st or in the middle of a month? Or that ‘weeklies’ fall on a Monday or Sunday or Wednesday. Unless my workflow means that certain data is supposed to exist in a monthly or weekly on a specific weekday/day of the month (unlikely), I don’t think it matters a huge deal imo. :slight_smile:


Do different prune -keep periods interfere with each other
#15

Thanks for the test.

Perhaps the reason Monthly starts from May 22nd could be that the retention is being calculated from the furthest time instead of the most recent and that your backup may have started around 22nd of about a year ago?

Do you mind running a --dry-run with this to see if Monthly would overlap within the Weekly?, which should have 2018-06-21 appear as part of a Monthly within the 7:14 period but I feel this will only shorten the Weekly down to older than 14 but younger than 30 days duration, so I think it would be influencing each others indirectly.
(30:70 -> 30:30)

-keep 0:365 -keep 30:30 -keep 7:14 -keep 1:1

Also, something small but how come the Monthly snapshot gets taken on 2018-03-22 but the next one is 31 days apart at 2018-04-22?.

Frankly, I feel the same as long as the amount of backups taken would end up being similar but the complication here is the predictability of what kind of backups you’d end up having. For new users who try to build up a retention policy, it’s not so easy to express it using the current parameters, as this thread kind of illustrates…


#16

Somewhat ontopic here: I remember reading somewhere that duplicacy prune takes into account the hour when prune is run. so if you take hourly/minutely backups -> if you rune prune in the morning or in the evening will matter (regarding what is deleted). I remember @gchen and some other smart people on github (maybe fracai?) saying that should be fixed.


#17

More than the monthly ones anyway, but if I say “keep a snapshot every 7 days for snapshots older than 14 days”, I would expect it to start by first keeping the youngest snapshot that is older than 14 days and then go back another 7 days and keep the next one. But your results show that it omits this first step (since it doesn’t keep one for 5 August).


#18

You’re right. My revision 1 was ran on 2017-09-23 and, if I rename the snapshot file to 1.bak and re-run the prune, all the revisions shift by one for just the -keep 30:70 option. May 23rd is kept instead of 22nd.

What might be useful right now is to add a bit more debugging to the code to include the snapshot timestamp and which -keep n:m option is responsible for each deletion…


#19

Just guessing without reading the code but perhaps it calculates it to start the Weekly on August 6th (by being inclusive from 19th) but since it has a Daily on it, Weekly could be simply overlapping on that day.


#20

Should consult the code or the author but I find it odd that if the Weekly (7:14) determines its pattern from the most recent days to be on Monday (as oldest 2017-09-23 is Saturday but it’s not pegged to that day), Monthly (30:70) is being pegged from the oldest day instead.