Duplicacy deduplication efficiency

mathome1 · 23 April 2021 05:28

I have some important files that I am backing up via 2 seperate mechanisms to the same file storage. In my particular case, this is a backup of Microsoft Teams files via 2 different mechanisms :-

Using onedrive on a PCs and then a backup with duplicacy with the files downloaded, and
QNAP NAS and HBS to sync the files from teams, and then Duplicacy to backup the files.

In both cases, the synced files on the PC and NAS is about 27GB. I only setup the QNAP option today, and because they are effectively pretty much exactly the same files, I would not have expected the size on the backup storage to change much, because I would expect Deduplication to do its things, for small incremental increase in the backup. However, I was a bit surprised to see some significant transfers and it appears over 2.5GB of additional space was needed. For reference the backup is about 48000 files and 2620 directories. But again largely the same on both backups.

Is this to be expected, and why is it not closer to zero size increase?

Thanks heaps.

Cheers
Matthew

domvoyer · 26 April 2021 17:25

I wonder if it’s because you did not use fixed chunks when you did your init for your repository. It may have used chunks of varying sizes which effectively would not be deduplicated. Can you confirm how you init your repo?

mathome1 · 26 April 2021 23:46

From memory, I created the repository with a pretty generic setup with no special customisations. But it was done some years ago (Oct 2018), so would have been using whatever was the current duplicacy version and defaults at that time.

How would I confirm if it was using varying chunk sizes, and what configuration option would that be?

Droolio · 27 April 2021 03:43

Hmm why do you say variable sized chunks wouldn’t be deduplicated? Duplicacy can deduplicate both fixed and variable chunks. It’s maybe true that fixed is more efficient with certain types of data - i.e. large, randomly-accessed, files - though the opposite is also possible.

@mathome1 Are you sure the 2.5GB doesn’t include other data? Run a check -tabular so you can evaluate the differences in revisions. It’d be interesting to see if the data grows with subsequent revisions, and whether many of the 48K files were touched or not.

mathome1 · 27 April 2021 04:49

I might have misread the table. Below is the table. I noticed the summary at the bottom was about 2,345M or 2.3GB. But I notice the the bytes that are totalled are 5160K and 2833K which is only only about 8M. So I am not sure how to read this as it does not seem to add up? I assume there is a good explanation to this, and I just need to be educated. But have I used 2G or so or closer to 8M which is nothing??

A further thing to not, these backups send to happen after another backup to the same destination which largely has the same files. This backup because before the one below. And while there might be the odd file that changes in between, on the whole the backup below should mostly be files that have been backed up before.

snap | rev |                          | files |   bytes | chunks |   bytes | uniq |  bytes |  new |   bytes |
 per-qnap1-HBS-Perceptor |   1 | @ 2021-04-23 14:56 -hash | 47820 | 27,253M |   5368 | 21,980M |    4 | 5,160K | 5368 | 21,980M |
 per-qnap1-HBS-Perceptor |   2 | @ 2021-04-24 12:30       | 47796 | 27,253M |   5369 | 21,980M |    0 |      0 |    5 |  5,160K |
 per-qnap1-HBS-Perceptor |   3 | @ 2021-04-25 12:30       | 47796 | 27,253M |   5369 | 21,980M |    0 |      0 |    0 |       0 |
 per-qnap1-HBS-Perceptor |   4 | @ 2021-04-26 12:30       | 47803 | 27,254M |   5370 | 21,982M |    3 | 2,833K |    4 |  4,803K |
 per-qnap1-HBS-Perceptor |   5 | @ 2021-04-27 12:25       | 47820 | 27,258M |   5373 | 21,997M |    0 |      0 |   10 | 20,570K |
 per-qnap1-HBS-Perceptor |   6 | @ 2021-04-27 13:15       | 47820 | 27,258M |   5373 | 21,997M |    0 |      0 |    0 |       0 |
 per-qnap1-HBS-Perceptor | all |                          |       |         |   5387 | 22,010M |  605 | 2,345M |      |         |

towerbr · 27 April 2021 12:04

Here you have some good explanations:

domvoyer · 27 April 2021 12:13

For certain use cases, I read that doing the init with -c 1M -min 1M -max 1M could allow for more duplication which could have potentially helped the OP there. I suggested that since if the repo was init by default, it should use varying chunk sizes, which theoretically could result in two backups dividing the files into different size chunks which would not get duplicated the same way. Hope this makes sense… in any case, I was trying to help.

Droolio · 27 April 2021 12:24

That’s a pretty odd discrepancy.

I assume the mismatch is that those 4+3 ‘unique’ chunks is the actual data and the remaining 605-(4+3) is the metadata. 48K is a fair amount of files but I never imagined it would tote up to 2GB, especially compressed.

What do your backup logs (if you still have them) say about how much data was file chunks or metadata chunks? (These log lines are at the very end.)

Droolio · 27 April 2021 12:47

That’s certainly the case if you have large files that get modified in-place, like a VM disk image, where data is only modified in-place and nothing is shifted up/down in the byte pattern…

However, a repository of 48K files would very likely suffer greatly if trying to use fixed chunking, as a small file deletion, insertion, or any file size change in fact, would shift the bytes to the extent that subsequent chunks would never align with the fixed chunk window and so not get deduplicated.

Variable sized chunks works better here, and in most use cases, as there’s a rolling hash to find chunk boundaries (for when data is shifted by insertions etc.). Tis a bit more computationally expensive, but should result in better deduplication on normal data.

domvoyer · 27 April 2021 13:12

That’s some great information! Thanks for sharing!

gchen · 27 April 2021 18:07

Duplicacy uses a pack-and-split method to divide files into chunks, so when you add new copies of files that have already been backed up before, a small number of new chunks may still be generated because new copies are now packed differently.

Duplicacy wasn’t meant to achieve 100% deduplication efficiency. Rather, lock-free deduplication and database-less chunk indexing are the 2 main design goals. I would argue that an overhead of 2.3 GB on a 27GB storage, or less than 10%, is pretty much at an acceptable level.