Best settings for pruning

GeekHero · 24 June 2022 18:51

Hi there, one of my hard drives is near to the 2tb mark and is nearly full. I am looking at setting up pruning on the drive so that there is some space available to backup to. There is an option to delete snapshots older than 1800 days. which I guess means any snapshots made over 1800 days ago, will be deleted.

But what about keeping x number of previous versions of a file? If I made a backup over 1800 days ago and my file is still on my computer and I made a backup a week ago, then I guess the file still stays on the latest backup?

The bit I’m a little confused on is where it says keep 1 snapshot every 7 days if older than 30 days.
So I guess if the backup was done 60 days ago, then another backup was 2 days later, then which snapshot gets deleted?

For the next option of 1 snapshot, every 1 day if older than 7 days, what would happen if I put 0 in? would I have to not put anything in order for Pune to ignore? I don’t really know what is the best options for retention?

saspus · 24 June 2022 20:34

Pruning won’t free much space unless your data is very volatile

I would just increase storage space on your server instead.

This simply adds -keep 0:1800 argument to the prune command.

Yes

I don’t remember but that’s the point — it does not matter. If it mattered - that would mean the backup frequency is too low or pruning aggressiveness it too high.

Edit: I think the best way to think about it is to follow what duplicacy would do: It considers each -keep N:M argument in order provided and applies the rules as follows:

Is the revision at least M days old?
- no? do nothing, consider next -keep argument
- yes? Proceed
Is there a revision within N days of this one?
- no? Do nothing.
- yes? Delete the revision.

This is also why the list shall be ordered: -keep 7:30 -keep 1:7 is not the same as -keep 1:7 -keep 7:30. The latter one collapses into -keep 1:7 because the first clause match all revisions that the second would match, and therefore second will never have effect.

It will delete all snapshots older than 7 days: “keep 0 snapshots older than 7 days”

Ignore the UI, just read documentation on prune “keep” option. The UI just assembles the series of keep parameters. Also note, they must be sorted.

This is subjective, there is no universally best. For me — no prune is best. Why would I ever want to delete anything, storage is almost free.

Another good default is what time machine on a Mac is default to: keep hourly backups for the past 24 hours, daily backups for the past month, and weekly backups for everything older than a month.

Droolio · 25 June 2022 13:51

My advice to you would be to run a check on your storage and look at the stats tables at the end. For each snapshot, there’s a bunch of figures including:

files
files (bytes)
chunks
chunk (bytes)
uniq[ue]
unique (bytes)
new
new (bytes)

Unique bytes will tell you how much data will be saved by pruning that snapshot alone.

However, you may save even more if you remove multiple snapshots (where, for example, they share chunks between them and no others), so take it as a bare minimum and as a rough indication of disk space savings.

For those of us with a finite amount of space and a closed wallet (almost everyone), pruning makes complete sense - not least because having too many snapshots, can be a big overhead when conducting checks, or prunes in the future. But you may not want to keep daily snapshots going back 5 years - you can control it by keeping, say, weekly backups after 1 year, and monthly after 3.

Whatever retention policy you use, removing snapshots will obviously save space, and setting a limit (0:1800) will ensure your backups don’t grow ad infinitum. You might find that keeping weekly/monthly backups past a certain point is enough to avoid having to add a 0:X limit (yet). Or visa versa.

saspus · 25 June 2022 17:26

I have never seen a reasonable prune policy to save more than 20% of space, based on my observations on a few computers by a few different users. The reason likely is that most users have a lot of static data and not much turnover, most of the overhead comes from deduplication inefficiency, which is pretty low as it is.

Not having to prune on the other hand opens other opportunities for hardening the backup further by disallowing rewrites and modifications. I’d argue this is way more valuable than saving 20% on storage.

Furthermore, storage that is worth using (low retention cost, high durability — archival tiers that Duplicacy does not support today but hope eventually will) comes with long minimum storage charges, it can be as high as 180 days. Pruning data earlier than that does not safe money; on the contrary, it forces payment immediately. And then if you keep those diffs for 180 days — might as well keep it indefinitely.

Related: because duplicacy does not yet support archival storage users are forced to use a very expensive hot storage, overpaying at least 4x. That 20% that not pruning will waste is therefore not worth discussing. If cost is concern — support for archival storage shall be implemented.

Droolio · 26 June 2022 12:33

Not everybody wants to store all their eggs in one basket (cloud-only) and most will wisely choose a straightforward, proper, 3-2-1 strategy that includes 2 or more local copies (perhaps more) situated on finite disk space. 20%, or however much it is for each person, isn’t a non-trivial amount, especially when multiplied by the number of backup copies.

As I pointed out, all anyone has to do is run a check and see for themselves a good approximation how much of their data is stored for any snapshot. In my experience, it always has to be pruned as some point, or the wallet has to be regularly taken out, just to keep expanding. That’s not an option for most people.

I manage dozens of systems with varying backup technologies - some with de-duplication (e.g. Veeam Agent, or Win Server Essentials client backup, the latter of which actually has very high de-duplication down to the sector level) - and when I look at the numbers, they all have to be pruned to some degree, or waste enormous amounts of disk space. This doesn’t have anything to do with de-duplication efficiency - it’s simply data growth, and the fact these, and Duplicacy, use snapshots in time.

If you do F.A with your data, it won’t grow, and neither will your backups. If you actually do stuff, it’ll grow regardless - proportional to how much you use your data, as will the backups. So sure, if you want to turn your backup plan into an infinite archive, knock yourself out. Most people won’t, and they can can check for themselves and decide on their own requirements. Backup methodologies have always had retention options (daily, weekly, monthly) for this very reason.

As for archival storage, IMHO that’s mostly folly that won’t save on cost for anyone seriously using it for backup purposes. If it was an order of magnitude cheaper instead of merely “4x”, it might make sense. Perhaps shipping only older snapshots for long-term archive, it might make sense. But for backups, where conducting test restores should be part of the backup plan, that’s just not feasible with archival tier storage, where the costs of restoring is an order of magnitude steeper. Unless you want to make assumptions about equating general cloud reliability with the ability of the client to reliably implement its backup logic… backups shall be tested.

I’m all for Duplicacy supporting separated chunk metadata (since it can be useful to segment that on local storage too), but not wasting time and effort beyond that just so a handful of people can save not very much at all. Would much rather development efforts be put into reliability; snapshot fossils, server-side checksum verification on all backends that support it (which would actually help with archival tier storages), better detection and recovery from missing/truncated chunks etc., and yes separated chunk metadata.