Duplicacy taking up more space on B2

cliff · 19 February 2021 21:49

Please describe what you are doing to trigger the bug:
Regular nightly backup and prune to B2 storage. I only have 1 snapshot at any given time. Local data is taking to 503GB but chunks show that it’s taking up 834GB.

The commands I run nightly via cron are:

/usr/local/bin/duplicacy backup -stats -limit-rate 3750 -threads 24
/usr/local/bin/duplicacy -v -log prune -keep 0:1 -threads 24
/usr/local/bin/duplicacy check -a -stats

Please describe what you expect to happen (but doesn’t):
Expect chunks to roughly match local data after several prunes.

Please describe what actually happens (the wrong behaviour):
There’s a ~300GB delta where Duplicacy / B2 is using more than what’s on disk.

This is what duplicacy shows:

root@pbs:/mnt/datastore# /usr/local/bin/duplicacy check -a -stats
Storage set to b2://CorsairBackup
download URL is: https://f000.backblazeb2.com
Listing all chunks
1 snapshots and 1 revisions
Total chunk size is 833,925M in 150724 chunks
All chunks referenced by snapshot datastore-backup at revision 32 exist
Snapshot datastore-backup at revision 32: 158666 files (500,088M bytes), 833,925M total chunk bytes, 833,925M unique chunk bytes
Snapshot datastore-backup all revisions: 833,925M total chunk bytes, 833,925M unique chunk bytes

This is what du shows:

root@pbs:/mnt/datastore# du -h -d0
503G .

saspus · 19 February 2021 22:28

If you only have one snapshot then it’s nether backup nor efficient use of duplicacy — it defeats the whole reason of using it. You’d be better off with rclone copy.

There will always be some inefficiency with chunking and deduplication, but unless you have multiple snapshots and/or multiple sources to dedup across you will only see inefficiency and no benefits.

The issue is exacerbated by having a lot of transient data churn. You should exclude that temporary and derivative data from the backup, regardless of which tool you use.

Side note -24 threads sounds excessive.

Edit. And lastly, to clean up the datastore call prune with -exclusive and -exhaustive flags while making sure that no other duplicacy instance accesses the datastore. This will force remove the unused chunks.

arno · 19 February 2021 22:31

I think I know what’s going on here and if you’re only going to keep 1 revision around, I think I know how you can fix it.

When duplicacy makes a revision it packs your files in to chunks. When it makes a new revision later, anything that hasn’t changed doesn’t need to get chunked and the revision will just reference the chunks that already exist. Any changed files are going to generate new chunks to backup. If you then delete the original revision you’ll likely have all the chunks from the original revision plus the chunks that cover the changes. You won’t be able to remove any of those original chunks until all the revisions that refer to them have been pruned. The solution is to use the -hash flag to re-hash all files. Now when you prune everything but the latest revision all those original chunks won’t be referenced anymore.

If you’re only ever going to keep around a single revision, always use -hash.
You could also try to tune the chunk sizes to better match how your data is changing.

This is a tradeoff with how duplicacy manages deduplication.

cliff · 20 February 2021 00:07

Thanks for the input, @arno! I will give -hash a try.

To address @saspus’s point about rclone: I initially did look into rclone but unlike duplicacy, it doesn’t maintain folder structure for empty folders. My use case is off-site B2 backup for my Proxmox Backup Server datastore, which already saves several backup revisions. So I don’t need duplicacy to keep additional revisions for me, mostly just a way to transfer and maintain file structure. If anyone knows of any better solutions for this, I’m all ears.

cliff · 20 February 2021 17:47

I did a bit more research on my use case, and it seems like I should just start over and use large fixed size chunks for my use case (128M?), since Proxmox Backup Server already deduplicates and chunks things for me. I realize that this is not the intended use case of duplicacy, but it works and rclone doesn’t fit my needs because of the empty directory structure limitation.

arno · 22 February 2021 20:32

To be honest, I consider being able to change the chunk size an indication that this “unintended” use is a legitimate reason to do exactly this.
Being able to change the chunk size can be a useful feature to better accommodate the source data.
I won’t speak to the appropriate chunk size, though 128M sounds very large. I suppose the default is 1-16MB with a target of 4MB, so maybe 128 isn’t that large?

This could be tedious, but maybe try a few different settings and compare the performance?

cliff · 22 February 2021 20:52

With my very limited Xfinity upload speed, I think I’ll stick with my current solution until there’s a good reason to change it. I created a new backup to B2 with 128M fixed chunk size, and it took the last 3 days to finish uploading ~500GB.

I got the 128M fixed chunk size idea from Chunk size for mostly large files?. Basically, it seems like if I set a large enough fixed chunk size, it’ll roughly be 1:1 in terms of local files to chunks. This prevents holes in the chunks, which I think was what was wasting space. And since my data set from Proxmox Backup Server is already chunked and de-duplicated, this seems like the right approach.

arno · 22 February 2021 21:04

Ah, sounds good. Thanks for the pointer to the other thread. Interesting reading.