Idempotent storage initialization

tack-duplicacy · 11 September 2020 02:57

I have a use case where I have several VMs (with new VMs coming and going over time), a SFTP-accessible server, and a number of disks to be used in a backup rotation.

I have a script to, from a central place, ssh into all the VMs, and run duplicacy to trigger a backup over SFTP. The challenge is that want to avoid the need to pre-initialize storage for all new VMs against each current backup disk before I can start a backup. Rather, that should be part of the backup script, to initialize the backend each time to handle the case where I slot in a never-before-used disk.

I tried the obvious to blindly run “duplicacy init” at the start of the backup process, under the hope that it would initialize against the backend if the backend didn’t already know about the snapshot id in question. However the init immediately command returns “The repository […] has already been initialized” based solely on the presence of the local .duplicacy/preferences file, as opposed to the state of the backend. The storage backend isn’t even consulted.

I can script my way around this, but I’m proposing a -force option with the init command, or something similar, to actually check with the backend to see whether things truly do need to be initialized (and in which case it would probably be prudent to invalidate local cache – at least snapshots and fossils).

I could submit a PR if this sounds sane.

Thanks!

saspus · 11 September 2020 03:57

This seems a very unusual usecase, if I understand what you are trying to do directly.

If you want to backup to multiple destinations — and your multiple drives are separate independent targets — create multiple destinations and do not try to reuse the target name.

What you do now only creates confusion as revision history will be scattered across drives with no simple way to reconcile; moreover pulling the carpet from under Duplicacy by switching underlying storage media with the same name will bite back as local cache will not be in sync with storage state anymore, among other things.

I think what you mean to do is backup to one drive and then occasionally replicate this backup to remote drive. Replication (see duplicacy copy) is supported natively.

(leaving out the discussion of bit rot and viability of backups on a single drive for now; let’s assume that when you say “drive” you actually mean “redundant array with checksumming filesystem and data integrity guarantees)

In other words, while you can make disk rotation work by using unique names for each disk, more prudent and logical approach would be to backup to a single destination and replicate the backup as needed.

tack-duplicacy · 11 September 2020 14:29

Thanks @saspus for the thoughtful response.

Part of the goal here is to avoid doing N up-front configurations (where N is based on the number of VMs I have to backup) each time I bring a new disk into rotation (or take an old disk out of rotation).

In my first post I suggested that cache should be invalidated if the storage backend needs to be initialized, although you’re right, it’s really any time the target drive changes. I have this working by including rm -rf .duplicacy/preferences .duplicacy/cache/default/{fossils,snapshots} before duplicacy init and the subsequent backup. Each disk does have a different view of snapshots, yes, but that’s fine.

When I say “drive” I really do mean single disk as, in addition to various local backups and cloud backups of critical data, my current backup strategy also includes a much bigger dump (including not-so-critical data) onto a set of large local external disks, which I rotate through as I bring one off and on site over time. This has been my approach for 15-or-so years with no durability issues observed so far (where bit rot would be apparent in my bi-annual restore tests, and exacerbated by the fact that I’m using dm-crypt underneath). As disks age or show signs of potential failure, I just take them out of the rotation.

But your reply has made me realize I’m probably misusing Duplicacy here. I do rather like the snapshotting capabilities of Duplicacy and tolerance to moving large files around the filesystem due to its chunk-based approach, but I think perhaps I’ll stick to trusty rsync after all.

Thanks for the feedback!

saspus · 11 September 2020 19:53

Are you backing up virtual machines themselves – i.e. virtual disks – or data from the virtual machines? Have you looked at https://www.verticalbackup.com?

And why not just snapshot the vm server storage and replicate snapshots offsite (depending on your filesystem there could be built in facilities for that)? That would be differential and extremely efficient, without need to copy entire dataset every time, or even scan and analyze data on each backup (as you would in case of rsync – which is really hard to make do backup; it’s a sync solution)

This has been my approach for 15-or-so years with no durability issues observed so far (where bit rot would be apparent in my bi-annual restore tests, and exacerbated by the fact that I’m using dm-crypt underneath)

You are either using relatively small disks (under 4TB), and/or you have been extremely lucky. data rot frequency and probability increases with the data density. If you keep doing it – you will get your data corrupted. The point being that your backup consistency should be guaranteed, not hinge on a “it’s probably going to be OK”.

I don’t know your system design but rotating copies across disks involves manual labor, and while I’m sure you had good reasons for doing that – it is not scalable nor sustainable. 15 years ago maybe this was the only reasonable solution; but today there is fast internet and cheap cloud storage – maybe it is worth looking into rethinking the approach to backup here. Again – I have no clue about your requirements so can’t really suggest anything specific.

tack-duplicacy · 11 September 2020 20:32

This is for a home backup solution. Were this a business, obviously I’d just use cloud storage and be done with it. (And do, at work.)

I currently hold about 1.5TB in Wasabi via Duplicacy, but have about 4TB of additional non-critical-but-I’d-rather-not-download-again software which I backup to local disks. My current disks for the past ~5 years are 6TB, and no durability issues, save a failed disk due to mechanical failure. Sure, ECC does its thing, but that’s a reality with any disk. I’m about to switch to 10TB disks so I’ll let you know in a few years how that goes.

I’m backing up the data from within the VMs, rather than the full VM disk. I’m using KVM so Vertical Backup wouldn’t work, but anyway snapshots with KVM are easy enough. I just prefer to backup the data rather than underlying virtual disk, which makes restoring select files much easier.

I did like your duplicacy copy suggestion from ealier though, and I think I’ll pivot to that. I’m already doing a local backup across all VMs to S3 storage in my network (Minio), so for backups to physical disk, it’s much cleaner to copy each from S3 to the mounted backup disk than it is to ssh into each VM and point duplicacy at sftp. The latter worked, but copy is much nicer, so thanks for that tip.

Cheers.