Duplicacy and btrfs snapshots

Diagon · 18 August 2022 20:10

As I consider how to apply duplicacy, having recently finished the documentation, I started thinking I might like to use something along the lines of (using pseudo-code):

btrfs send | duplicacy backup

Is there any way to pipe to duplicacy and have it do its magic on the large file that btrfs send produces?

The one way I’m thinking could work (?) is to create a direcory with a fifo, initialize that as a repository, and then btrfs send to the fifo just prior to running duplicity backup. Any other thoughts?

saspus · 18 August 2022 20:19

I’m wondering what would be the usecase/ benefit versus mounting the snapshot locally and backing up files from it (which also allows to apply exclusion patters and what not)?

Diagon · 18 August 2022 20:29

A standard way to backup a btrfs system is to snapshot and then send to another btrfs system or to a file. That kind of backup makes recovery extremely easy: btrfs receive and you’ve got your system back in the state it was when you snapshotted it. I’d like to do basically that, but if possible use duplicati to do a remote transfer to a server that I don’t control, which necessitates encryption. The deduplicaiton and compression is a (major) plus.

Edit: I don’t know if you’ve ever looked at btrbk, but it has a backup solution that is basically:

btrfs send | zstd | gpg | ssh

That’s what I’m trying to do, but better.

saspus · 18 August 2022 21:00

I understand, but I assume the reason to use duplicacy as opposed to pak/encrypt the whole snapshot is deduplication efficiency: with the latter approach there is none (unless you send differential change, which creates a whole lot more maintenance), while passing it through deduplicating backup solution there is some.

I don’t have proof but it feels like backing up mounted filesystem would result in better deduplication efficiency: duplicacy sorts all files and then applies rolling hash, while with the image stream you get what you get: information about file boundaries and order is lost. On the other hand this may result in more stable dataset. It’s definitely worth testing!

Separately, if you are compressing the whole stream I’m wondering if archiving solutions would be a better fit? E.g. add the whole stream to xz archive, and let it handle compression and deduplication? This will end up with less ram rewuirements as a result.

Diagon · 18 August 2022 21:35

It might be the opposite. I noticed in the docs that the suggestion is for duplicacy to be set to fixed size chunking for large files:

The recommended configuration is to set all three chunk size parameters in the init or add command to the same value:

    `duplicacy init -c 1M -min 1M -max 1M repository_id storage_url`

Duplicacy will then switch to the fixed size chunking algorithm which is faster and leads to higher deduplication ratios.

Recovery with such a system would be much easier, because of procfs, sysfs, devfs, tempfs etc…

saspus · 18 August 2022 21:39

Good point. This (fixed chunking) essentially turns duplicacy into Vertical Backup which is specifically optimized for disk image backups. I guess handling data streams should be a good feature request there as well, unless already supported.

Diagon · 18 August 2022 21:53

Ok, I’ll definitely check into that.

Do we have an IRC or Matrix/ Discord or something? Might be worthwhile for me to check in there, if so.

gadget · 18 August 2022 23:35

I suspect that network and storage savings from deduplication will be a mixed bag because the stream format used by btrfs send doesn’t generate anything remotely close to a dd if=/dev/sda. It’s more like encoding write operations (e.g., the protocol includes a “create a symlink” command), plus there’s a minimum of two subvolumes as input.

Sounds like Linux, if not some variant of Un*x, so just my $0.02…

I’ve always been torn between backing up a whole filesystem versus backing up only non-system files.

There was a time when I favored system imaging until /proc and other special system directories became much more dynamic and ethereal (nowadays, how often is there a need to manually use mknod?). NetworManager, nftables, systemd and other newer system components for Linux regenerate everything from config files – e.g., the laptop I’m typing this post from currently has 12,224 subdirectories containing 276,110 files under /proc, none of which are of any value to me in a backup.

Given the extreme ease of reinstalling a typical Linux system from scratch (most distros have nary a Rumpelstiltskin-like agreement to click thru during the process), it’s often faster to reinstall a base system → add any extra packages → then restore system config/user data (especially if part or all of the process is scripted) than it is to re-image the same machine from a snapshot → then restore any files that had changed since the snapshot was taken.

I’ve also found that imaging often results in preserving a system that accumulates so many small changes over time, that the person who ends up needing to deal with restoring it during disaster recovery isn’t always the same person who also knows how the system was originally built (it’s like the electric wiring in a house after 50 years). By which point no one wants to risk touching it unless absolutely necessary for fear of something breaking.

sevimo · 19 August 2022 11:24

If you have access to another volume, I’d just save image to file and then run on that. This also helps with common restore cases, as the most recent version(s) are local and do not need to be downloaded from remote.

I am running something like that, except my images are from Veeam and not btrfs.

Diagon · 25 August 2022 20:29

Ended up posting a github request, here.