Did you take a look at the Prune command details? In particular this part:
Doing duplicacy prune -keep 30:35 -keep 7:7
is the same as doing duplicacy prune -keep 7:7
and then doing duplicacy prune -keep 30:35
Did you take a look at the Prune command details? In particular this part:
Doing duplicacy prune -keep 30:35 -keep 7:7
is the same as doing duplicacy prune -keep 7:7
and then doing duplicacy prune -keep 30:35
Yes, I’ve read that part many times, maybe I’m just dumb to get a sense out of it but I have a similar question but if I’m taking hourly backups continuously and want these to be kept,
is this how it should be written?
duplicacy prune -keep 0:365 -keep 30:70 -keep 7:14 -keep 1:1
I was wondering whether this will keep a weekly backup at the 7th day and monthly backup at the 35th day.
Also with the current -keep
method, I don’t think I can let a monthly snapshot be kept 1 for each month as I can only specify an absolute number of days. (As in, having 30 days apart could end up having one from 1st and the next from 31st.)
Is it possible to implement the intuitive restic option for the retention policy?
I’m no expert on this, but it seems to me that the argument makes more sebse the other way around: a one-per-month policy would be followed by keeping a snapshot from the 31st and one from the 1st of the next month whereas the 30 day policy makes sure that they’re 30 days apart, no?
Looking at Removing backup snapshots — restic 0.9.2 documentation, I don’t see much difference to duplicacy (rather, it looks like the duplicacy I notice that it says " prune
options have been inspired by the restic forget
options).--keep-daily n
for the last n
days" instead of just “keep one for every m for the last n days” as in duplicacy. Is that what you mean? I agree that this would be better. Edit: no wait, isn’t the duplicacy way of doing it more flexible because it’s more generic? In that case, maybe introducing the restic options as aliases would help?
I notice one important detail in the restic documentation, though: they specify that it is always the last snapshot that is kept, for example:
--keep-weekly n
for the lastn
weeks which have one or more snapshots, only keep the last one for that week.
I’d assume that this is the case with duplicacy too (@gchen?) and if that is the case, we should make it explicit in the documentation.
This is also something that you can do in restic but not in duplicacy. In duplicacy you can just keep 24 in the last day and if you made hourly backups, these will we the ones kept.
Just to be clear, this is how I want the backups to be taken for the definition 2 posts above. (Forget the hourly for now.)
Whether one wants to take a monthly or 30 days apart snapshot depends on use cases, so I don’t really care which way it works but to be honest, it would be better if both cases can be specified.
What I’m wondering is that, maybe I’m still confused but if I specify the prune
option as 2 posts above, will it keep the backup on ‘July 31st’ in the image? Because it seems that -keep 30:70
may mean, ‘keep 30 days apart backup for backups older than 70 days’ and might exclude that day which is what I want to keep as part of the monthly backups.
I think the restic
policy would achieve doing this (but since it’s uselessly slow on restore operation on small files, I don’t want to use it, though they seem to be trying to fix it.)
The thing is that it’s only possible to answer that question under certain conditions, the most obvious one being that you actually made a backup on 31 July. Also, if you made a backup on that day, I think it would have to survive more short term pruning. I’m not sure how exactly to predict that.
So it seems to me that part of the confusion might be due to your trying to translate a system of relative times (and moving boundaries) to absolute dates…
But let’s look at the command you are referring to and figure out what it does. (I’m also doing that as an exercise for myself, as I can’t say I have fully grasped this yet in practice):
So what will duplicacy do here? First, it checks for any snapshots older than 365 days and deletes them. That part is easy.
Then it checks for snapshots older than 70 days (which in this case means: older than 70 days but younger than 365 days.). Of those, I suppose it keeps the youngest, then it looks at the second youngest one: if it is less than 30 days away from the youngest one, it is deleted. And so on until it finds one that is 30 or more days older than the youngest one identified above and it keeps that one. It will do the same for another 30 day period, and so on until it cycled through all snapshots older than 70 days.
Assuming that when this command was started, there was one snapshot for every calendar day, and today is 19 August, the above concerns snapshots from before about 9 June. So in contrast to your image, I’d say that it
will keep the 9 June snapshot but not the one from 31 May. It would keep 9 May rather than 30 April etc.
But this applies only if we still have snapshots for these days. If you previously ran the command on 31 July, the 9 May snapshot would already have been deleted so it would end up keeping 31 April because that’s the one that’s left. And so on. In other words, it matters on what date you prune for the very first time.
Okay, but let’s continue the command. Next up is -keep 7:14
. Now we’re looking at the snapshots older than 14 days (but younger than 70). We keep the youngest (which is the one from 5 August). The next one to keep is the one from 7 days earlier, so from 29 July. And so on. So pretty close to what we see in your calendar. (And the difference may be my fault. I probably shouldn’t keep the one from the 5th but rather the 4th. Doesn’t matter.)
Last step: Snapshots older than 1 day (but younger than 14 days). We keep one per day, which is exactly what your calendar shows.
Nice “debug” @Christoph!
I think this is one of the main causes of confusion here in this topic.
Here is one point I have doubt about the prune
command. I understand that the command will continue to search for revisions, even within the “boundary” of the previous larger “-keep”. And “it may” be that it finds a revision that fits the criteria of the second “-keep”, and erases it. I don’t know if it “respects” the previous keep “boundary” …
Thanks for the detailed response.
Yes, I was assuming the ‘hourly’ backup has been continuously happening for the entire duration covering the retention policy.
Yes, this is true. I was spoiled into thinking the retention policy by relative dates through the use of utilities like rsnapshot
, some cloud provider like Linode gives you policies such as ‘weekly backup every Tuesday and daily backups for the last few days’ (which is more advanced than restic as you can define which day of the week instead of getting pegged as the last one of the week, which probably means Saturday) for their server instances and recently since restic also had similar approach into defining the policy.
While I understand the flexibility given by having only days as the unit to specify (with less complication for implementation), when it comes to dates, it may not work too well for people who would expect such relative date policies.
I can certainly write my own custom script to do the retention however I like (while I can’t do anything about restic’s slow restore unless I get to patch its code myself), I’m wondering what the author’s take is on the future of duplicacy’s retention policy. Personally, I feel it has a room for improvement even if it stays with absolute day policy only, such as when it isn’t immediately obvious that the ordering of -keep
is important and how each -keep
interacts with each others by checking the doc on duplicacy help prune
.
If I’m right, I think this is what will be kept with duplicacy by issuing this,
-keep 0:365 -keep 30:70 -keep 7:14 -keep 1:1
which is not too different from the relative date policy in terms of amount of backups but I think it is pretty hard to predict what gets left by just writing the option in the command line. (I wasn’t sure if ‘older than’ meant inclusive or not, so, some parts could be a day off.)
I think the current absolute date method probably cannot have overlapping backups between each -keep
which may become a problem in some cases.
I assume in this case that while Daily
keeps moving day after day following the last 14 days, the Weekly
and Monthly
will stay there until August 12th
is cleared of Daily
(on September 2nd
) and be kept as a Weekly
and triggers June 17th
to be pruned.
For science, I just ran your retention line on one of my as-yet-unpruned repositories with the -dry-run option. (It has 6am daily backups for nearly a year.)
Snapshots that weren’t deleted include August 6th through 19th as expected. Plus the following…
2018-07-30
2018-07-23
2018-07-16
2018-07-09
2018-07-02
2018-06-25
2018-06-18
2018-06-11
2018-05-22
2018-04-22
2018-03-22
2018-02-20
2018-01-21
2017-12-22
2017-11-22
2017-10-23
I can’t make head nor tail of it either.
Although actually, the weekly dates make sense to me, I don’t quite know why it keeps the monthly snapshots starting on May 22nd.
Is it because it’s roughly 3x 30 days from today, rather than from 70 days ago? If so, the 70 days seems to be just a qualifier for snapshot candidates to delete, and not an epoch by which differences in dates are calculated. That would also mean it doesn’t keep the ‘oldest’ or ‘youngest’ within the retention period, but the periodic snapshots relative to today’s date, or more importantly the date when prune was first run.
What I also know is that -keep
options are independent from each other, and they don’t interfere with each other’s range. You should be able to run them individually one after the other.
Personally… I can understand the -prune
syntax enough to appreciate its flexibility, and it doesn’t bother me too much that you can’t align it perfectly to a calender.
Does it really matter ‘monthlies’ fall on a 1st or 30th/31st or in the middle of a month? Or that ‘weeklies’ fall on a Monday or Sunday or Wednesday. Unless my workflow means that certain data is supposed to exist in a monthly or weekly on a specific weekday/day of the month (unlikely), I don’t think it matters a huge deal imo.
Thanks for the test.
Perhaps the reason Monthly
starts from May 22nd could be that the retention is being calculated from the furthest time instead of the most recent and that your backup may have started around 22nd of about a year ago?
Do you mind running a --dry-run
with this to see if Monthly
would overlap within the Weekly
?, which should have 2018-06-21
appear as part of a Monthly
within the 7:14
period but I feel this will only shorten the Weekly
down to older than 14 but younger than 30 days duration, so I think it would be influencing each others indirectly.
(30:70
-> 30:30
)
-keep 0:365 -keep 30:30 -keep 7:14 -keep 1:1
Also, something small but how come the Monthly
snapshot gets taken on 2018-03-22
but the next one is 31 days apart at 2018-04-22
?.
Frankly, I feel the same as long as the amount of backups taken would end up being similar but the complication here is the predictability of what kind of backups you’d end up having. For new users who try to build up a retention policy, it’s not so easy to express it using the current parameters, as this thread kind of illustrates…
Somewhat ontopic here: I remember reading somewhere that duplicacy prune takes into account the hour when prune is run. so if you take hourly/minutely backups -> if you rune prune in the morning or in the evening will matter (regarding what is deleted). I remember @gchen and some other smart people on github (maybe fracai?) saying that should be fixed.
More than the monthly ones anyway, but if I say “keep a snapshot every 7 days for snapshots older than 14 days”, I would expect it to start by first keeping the youngest snapshot that is older than 14 days and then go back another 7 days and keep the next one. But your results show that it omits this first step (since it doesn’t keep one for 5 August).
You’re right. My revision 1 was ran on 2017-09-23 and, if I rename the snapshot file to 1.bak and re-run the prune, all the revisions shift by one for just the -keep 30:70
option. May 23rd is kept instead of 22nd.
What might be useful right now is to add a bit more debugging to the code to include the snapshot timestamp and which -keep n:m
option is responsible for each deletion…
Just guessing without reading the code but perhaps it calculates it to start the Weekly
on August 6th
(by being inclusive from 19th) but since it has a Daily
on it, Weekly
could be simply overlapping on that day.
Should consult the code or the author but I find it odd that if the Weekly
(7:14
) determines its pattern from the most recent days to be on Monday (as oldest 2017-09-23
is Saturday but it’s not pegged to that day), Monthly
(30:70
) is being pegged from the oldest day instead.
Now that I think about it, it’s probably calculating everything from the oldest day as after 30:70
stops at May 22nd
as the test shows, it would enter 7:14
and jumps 7 days from there and if being inclusive next candidate would be May 28th
which is Monday and then keeps picking up Monday but since 30:70
has ‘eaten’ everything up until 70th day (which is around June 10th
), 7:14
would pick up from June 11th
until 14 days ago (again, being inclusive to be August 6th
) and rest are 1:1
.
Yes, when a new retention period starts, the first snapshot is always kept and used as the base. Then subsequent snapshots are checked one by one. If a snapshot is too close, then it will be deleted. Otherwise it will be kept and become the new base.
One issue with the current implementation is that actual times of the day that snapshots may vary a bit (as well as the time at which the prune is run). As a result a few seconds in the time difference may cause a completely different set of snapshots to be deleted (for example this issue: Impact of Prune on storage Copy operations). I think the solution is to consider only the days when comparing the time different between two snapshots.
by “first”, do you mean youngest or oldest?
And could you explain
This is a good point, by my understanding is the youngest.