Is there an efficient way to backup ZFS datasets of VM disks?

Tastal · 2 August 2019 01:44

I have Linux (Proxmox) host on ZFS running some Qemu/KVM VMs and am thinking if I can utilize Duplicacy for off-site backups.

Here are my thoughts and concerns so far and since I’m really unsure about everything of it, it would be great if someone could comment on it a little bit:

I do want to run the backups form the host and not inside the VMs since ZFS snapshots/clons are great to get quite consistent data states. Am I right by thinking that backing up a Linux guest with a running DBMS of some sort (be it MySQL, MongoDB or what ever) “from the inside” with Duplicacy would be just the same kind of gamble on the repair capabilities of the DBMS as it would be with nearly any other backup solution out there?
Ideally I’d like to do something like zfs send rpool/data/vm-100-disk-1@my-backup-snapshot | duplicacy-some-command-to-process-and-backup-data but I fear Duplicacy cannot do this as it is only designed to work with files. Is this right and if not how would the command look like?
When piping zfs send to Duplicacy should be no option (otherwise skip this point) I really would like to avoid sending a snapshot to a temporary file and backing up this file because each would be between 40 and 400 GB which would mean massive write overhead. And all snapshots/clones are also always available via /dev/zvol/rpool/data/vm-100-disk-1@my-backup-snapshot as “normal” disk / block level device. However obviously with various file systems of their VM guest’s OS which cannot easily be mounted on the host. Is there a good way to tell Duplicacy to read them block by block or a workaround to mount them as a file that would look similar to the output of zfs send rpool/data/vm-100-disk-1@my-backup-snapshot > my-temp-file-to-backup but without the write overhead?
If there is a way or even if I really go for the ugly temporary file, what would be the best chunk size setting in this case? I read that fixing it to 1M is recommended for other VM technologies but with ZFS I always would have the full disk content (I’m not planing on implementing a incremental ZFS snapshot strategy) and wouldn’t I strip Duplicacy of some of its de-duplication capabilities by fixing the chunk size? I. e. if I have another VM with the same OS but which might have files/stuff at slightly different positions, will I have any de-duplication effect?
Is there anything else to consider or why this could work or shouldn’t be done at all?

Thanks a lot for any input on this topic!

gchen · 5 August 2019 03:56

I don’t have experience with ZFS but I’ll try my best to answer your questions.

That is correct. Duplicacy can’t guarantee consistent reads on the DBMS files.

Duplicacy can’t read from stdin.

The fixed chunk size would be sensitive to byte insertions/deletions, so I think in this case it makes sense to use the default variable chunk size setting. However I don’t know what the optimal chunk size should be; you’ll need to figure it out by trials. And yes, the same file at different disk positions may invalidate the deduplication, unless you’re using a chunk size close to the disk sector size (which I would not recommend as the performance with such a small chunk size would be really bad).

leerspace · 24 August 2019 09:41

ZFS snapshots are all readable as regular file systems at /path/to/zfs/filesystem/.zfs/snapshot/name-of-snapshot. I’m not totally confident it would work as desired, but one idea would be to symlink the backup directory to the desired virtual snapshot directory (e.g., with a script) before running the backup.

Tastal · 25 August 2019 02:05

Thanks @gchen for your answer. Meanwhile I’m more or less away from the idea and am using a “dumber” but in this case faster backup solution on the host. I think a lot of the real advantages of Duplicacy come in cases where it handles the actual files (and here it really shines ;-)).

They are - as I wrote above - a block level device in this use case (and pretty much every use case where you use ZFS on a virtualization host). See my third point for the challenges that come with that.

leerspace · 25 August 2019 18:26

Oops, sorry I missed that. I’m not sure anyone’s found a way to back up block devices in general with duplicacy (yet) – let alone trying to trick the OS into being able to read a zvol snapshot as if it were a VM image file.

It sounds like you found another solution, but FWIW there’s a related thread here for trying to back up block level devices (unsuccessful): Backing up block devices

knightar · 25 August 2019 19:03

I wouldn’t call my attempt unsuccessful, it just calls for an extra step, writing the contents of the entire block device to a file in the repo before backing up. It just takes longer to do, and it’s not directly backing up a block device.