Pruning strategy for high rate-of-change data / mixing pruning settings

So I have a couple datasets that are large blobs, or high rate-of-chage. Examples include, vzdump files from proxmox VM backups or surveillance videos from cameras. For all my other data, I’m finding dedupe very effective at storing extra versions without much additional storage, but I’m wondering about this data set. My general prune arguments are as follows with hourly backups:

-keep 30:365 -keep 7:30 -keep 1:7

I suppose my question is a few fold…

  1. Does anyone know how well dedupe with Duplicacy works on vzdump images? The base VMs are not changing much per day, so assuming the de-dupe works well, then this may be a non-issue.
  2. Surveillance video files seem like they wouldn’t have much opportunity for dedupe. Is anyone else backing up this kind of data?
  3. Assuming either of these don’t dedupe efficiently, I guess I will need more aggressive pruning… Should I create a separate storage location for these datasets so that their prune settings don’t conflict with my general ones above? Or should I use the -id option to prune just the video and vzdump id’s more aggressively, then do a second prune pass with -all using my general settings above?

Thanks in advance for the communities advice!

The following topics might be of interest here:

And here:

I’ve been getting good deduplication results on VirtualBox VM backups with fixed size chunks setup.

Yes, probably no deduplication. I back up my movie library and don’t see any reuse of chunks.

I prefer the first option, with a separate storage. It will allow you to have more granular backup control, and future configuration adjustments (prune, eg) will be easier to make.

Was that part of your testing where you compared fixed and variable chunk sizes? If so, what was the difference?

More specifically, what I’m asking is whether it makes sense for me to change my current VM backup from variable to fixed chunk size or whether I should just leave it alone…

Since I did these tests, and after Gilbert’s various comments that this is the configuration used by Vertical Backup, I have always used the fixed size chunk configuration for large-files-that-undergo-minor-modifications. In my case, basically MySQL databases, Veracrypt volumes, the Evernote database and VM files.

As I no longer do this type of backups with variable chunks, I can’t show you the difference, but I can give you a concrete example.

A folder with two virtual machines and a few associated files (logs, etc., I admit that I lazily didn’t filter the backup):

VMs

(the two purples are the vdi files)

The backup is performed manually (note the dates) (these are not critical files), but the machines are used a few times a week:

| rev |                          | files |   bytes | chunks |   bytes |  uniq |     bytes |   new |    bytes |
|   1 | @ 2018-12-18 13:00 -hash |   397 | 13,946M |  14047 |  7,834M |  2463 |    1,189M | 14047 |   7,834M |
|   2 | @ 2019-01-07 08:44       |   389 | 13,946M |  13849 |  7,680M |  1532 |  709,010K |  2281 |   1,039M |
|   3 | @ 2019-02-05 13:39       |   389 | 13,949M |  14049 |  7,803M |  2391 |    1,176M |  3917 |   1,908M |
|   4 | @ 2019-03-12 10:23       |   389 | 13,951M |  13978 |  7,729M |  2079 | 1018,387K |  3310 |   1,611M |
|   5 | @ 2019-04-09 14:16       |   393 | 13,958M |  14070 |  7,767M |  2207 |    1,044M |  3233 |   1,549M |
|   6 | @ 2019-05-27 16:34       |   389 | 13,958M |  14126 |  7,892M |   412 |  183,321K |  4061 |   2,074M |
|   7 | @ 2019-05-28 17:10       |   389 | 13,958M |  14126 |  7,899M |   348 |  136,913K |   415 | 191,581K |
|   8 | @ 2019-07-01 22:14       |   389 | 13,970M |  14177 |  7,871M |  1438 |  686,846K |  4831 |   2,341M |
|   9 | @ 2019-07-10 10:30       |   389 | 13,970M |  14059 |  7,826M |  1521 |  759,471K |  1521 | 759,471K |
| all |                          |       |         |  37616 | 19,290M | 37616 |   19,290M |       |          |

Note the “new” chunks and related “bytes”.

The two largest backups (revisions 7 and 9) occurred when OS’s were updated.

1 Like

I have never been good at interpreting duplicacy stats (it conduses me that duplicacy speaks of chunks beer being “uploaded” when the chunk was already present in the storage) so let’s see whether I get this one right: so what we’re seeing is that out of a repository that is almost 8GB in size, a substantial proportion (1-2 GB) is being reupploaded every time, right? That’s a lot, I think, but what can we expect to get with fixed size chunks? I would hope well below 1 GB, perhaps even less than half a GB? But I have no idea.

Maybe I should just try it out by excluding my VMs from my regular backup and putting them on a separate fixed chunk backup. But I’m not sure how to get the relevant stats before and after… I’m also not sure whether I can use the same storage for both fixed and variable backups.

Yes, some text in logs are confusing, but I think I’ve gotten used to it, whether it’s good or bad. Do you remember this topic?

Nope, the repository has 13 Gb on local disk, the 13,946M on the first “bytes” column above.

Yes. But note that the backup is done at several week intervals, and the machines are used several times a week.

Let me detail the usage a bit more: the bigger VM is an Ubuntu that I use to access internet banking. The smallest is a WinXP VM that I use to access an old e-catalog, which I couldn’t make it work at all on newer Windows.

In either case, I don’t want to lose just the machine configuration, so I don’t need frequent backups.

An idea: leave your VMs in the current backup as they are (for now, while you evaluate whether or not to adopt another solution), and create two new storages: one with variable chunks, which you will only back up the VMs (using filters), and one with fixed chunks, which you will only back up VMs too. Follow for a while and see if there is any difference.

If you use the same current storage you will not be able to compare easily due to the other chunks.

If the results are not good, just delete the new storages and continue the way you were already doing.

If the results are good, the new storage is already initialized and in use.

3 Likes

To expand on this idea, you could easily seed both new storages with snapshots from the original storage.

You could even script it, by iterating through all the revisions and doing a restore of the VMs - starting from the oldest revision to the latest. After restoring each revision, one-by-one, backup to the new storage(s). With this method, you could directly compare the efficiencies between fixed and variable size chunks.

Although I am willing to bet fixed size will win for both efficient storage space and computational overhead. Either way, you could see how much is saved.

Incidentally, I have a separate storage for VMs and have set it already to fixed size chunks, backup is done manually.

One thing I do to improve efficiency for some of the larger VMs, is have at least one VM-based snapshot containing the base state. When I use the VM, any changes get stored in the delta .vmdk’s - not the main, big images. When Duplicacy runs a backup, it skips the main .vmdk’s as even the modified dates are unchanged. Also, before running the backup, I tend to run a quick disk cleanup, which defrags and shrinks those .vmdk’s further.

1 Like

Well, yes and no:

TBH, I don’t want to use too much time trying to understand how to read and interpret cryptic stats. Maybe I’ll eventually get it (I’ve bookmarked the topic for future reference), but I’m not gonna sit down studying it.

Yes, I guess this would be reasonably easy to do. I could take the opportunity and start using the web-ui for this (still hoping for a simple way to migrate from CLI to web) but I’m just back to work after the summer vacation and I’m not sure I’ll find the time. When I di find time for fiddling wifh my home server, my priority, at the moment, is to install docker and see whether it allows me to get rid of my VMs altogether…

1 Like