Tuning for volume snapshot data

sevimo · 31 May 2022 17:30

TL;DR - I am looking for recommendations/feedback on how to best set up duplicacy to backup a bunch of disk snapshots from several machines.

So I am using Veeam agents to store volume level snapshots of system drives from several machines (say, about 10 or so, perhaps more) on a NAS. If you’re not familiar with Veeam, the solution takes periodic snapshots of the whole disk and stores it as a single file, with incremental files generated on other days that only cover changes (so usually are much smaller).

But let’s ignore increments for now and pretend that we only need to handle full backups. What I want to do is to dump these snapshots into cloud storage(s) via duplicacy. What I would like to do is to minimize storage requirements on the NAS, and also minimize cloud storage/transfer time as a secondary goal.

Veeam does compress images, and have several levels of compression. Compression can be completely disabled, it can use dedupe-friendly compression (I believe RLE) and optimal one which I believe is LZ4 (there are stronger compression levels but these usually make little sense due to CPU usage). The differences in local storage requirements are considerable, something like 430GB / 324GB / 240GB for a 500GB system drive.

The question is, how effective duplicacy’s dedupe would be on such source? In theory, there should be a lot of overlap in data between different machines, as a lot of them would be regular Windows laptops with the bulk of the data being the same OS (e.g. Win 11 for instance). But given that our source is disk image in a single file, would we see meaningful dedupe across machines? And how RLE/LZ4 compression on such files would impact it?

I am planning to do a separate repository and probably a separate storage for these snapshots, as this would be a very different data from regular file-based stuff, likely with a very different pruning policy.

If you don’t have experience with compressed images, we can start with regular uncompressed images (like ones from dd), would you see meaningful dedupe for such images across different machines with the same OS?

towerbr · 1 June 2022 10:57

Just a curiosity: why do you backup the entire disk of laptops and not just user files?

bkeeper · 1 June 2022 13:25

This allows for easy baremetal restore.

sevimo · 1 June 2022 14:30

This allows for easy baremetal restore.

This. Also, application settings are almost impossible to effectively backup using file-based approaches, especially on Windows. I’d still do separate backup of core user files, but that would be a separate process.

saspus · 1 June 2022 16:49

Why? Files are files. Application data is just another type of user files.

Reinstalling the OS and applications is a quick process (and in case of windows, also healthy). Not wasting resources saving high turnover system files on every backup on the other hand is more valuable than saving few minutes during restore once, and maybe at that.

Just look – even Veeam transfers 2-4GB of data on each backup, even when nothing changes from the user perspective.

I would strongly reconsider full system backup. It’s a horrible idea in vast number of cases.

If you have highly specialized system – then just create an image once, and never touch it again. Ongoing system backup is not useful.

gchen · 1 June 2022 16:52

I don’t have experience with Veeam agents. It looks like RLE is designed for this purpose. However, I would suggest not using Veeam compression at all, because Duplicacy can do LZ4 by default and there is no conflict between compression and deduplication.

If the source is a disk image file then Duplicacy’s deduplication should work really well. You can even turn on fixed size chunking when initializing the storage. But, I don’t know if a Veeam snapshot/dump is the same thing as a disk image. For example, if the snapshot/dump rearranges disk sectors in a completely random order then there won’t be any deduplication.

sevimo · 1 June 2022 17:05

Have you tried to backup/restore Windows registry recently? If you’re willing to re-install and re-setup all applications, good for you. I have machines that have been running for more than a decade, who knows what and where something is set up. With volume imaging I can restore anything in less than an hour, without even thinking about it. No way am I reinstalling it from scratch, in many cases I don’t even know what needs to be re-setup until it is needed.

To each his own.

sevimo · 1 June 2022 17:23

Yeah, they claim RLE is dedup-friendly, though they’re talking about dedup appliances, and I don’t know specifics of how these work. Using Veeam with no compression is not very appealing as local storage then baloons - we’re talking several additional TB that I need to allocate for that locally. I’d rather have less dedup in the cloud (with more aggressive pruning then) as the cloud would be the last resort.

Unless I can point Veeam directly to cloud (via rclone share?) and bypass local storage, but from what I understand this is a bad idea for variety of reasons.

But, I don’t know if a Veeam snapshot/dump is the same thing as a disk image. For example, if the snapshot/dump rearranges disk sectors in a completely random order then there won’t be any deduplication

Yeah, that’s the part that I am not sure about. The format is proprietary, so not a whole lot is known about how exactly they store it. Given that one can schedule maintenance of such files for defrag/compression leads me to believe that these are likely not rearranged too much until such maintenance task is run (if ever). Also, as they incorporate increments into the full snapshot, it stands to reason that they won’t want to rearrange the whole large file in order to incorporate a small increment. But this is circumstantial evidence.

Is there an easy way of checking dedupe stats on a bunch of files without running actual backup? I can try to run it on a bunch of snapshots, though it still won’t give me longer term trajectory.

P.S. Actually, now when I am thinking about it, do you expect a lot of possible deduplication between smapshots from different machines even if compression is completely off? Overall data would be similar, but sector distribution would likely be different, and I am not sure if duplicacy would be able to dedupe such snapshots effectively. Disk sector size is likely to be much smaller than block size (but perhaps with low fragmentation this doesn’t matter much?)

saspus · 1 June 2022 17:37

That’s what this my comment was about:

Setup, image once, save the image. What do you need ongoing system backup for?

sevimo · 1 June 2022 17:50

These images are not static, applications get added/removed and change state without users specifying particular files. E.g. an application might store a bunch of data in sqlite files stored somewhere in user profile, or program data, or elsewhere. One can track such files to a degree on a one to one basis, which may need a lot of work, and then to restore it back is even less obvious (you need to know what to override and where, and hope you didn’t miss some other part of the state like the registry).

Again, I’d rather not argue about what my requirements are, if purely file-based approach works for you (and it does for many people) - great. But there are reasons why volume-based backup solutions exist and are quite common in the enterprise world.

bkeeper · 1 June 2022 18:20

On windows If you manage a thousand (or even a hundred) machines, a full disk image is the way to go.

Fresh installs are just for new machines. (every 3-5 years).
File-level backup works great on Linux.