Is there a benefit to breaking up a large repository?

I’m currently trialling Duplicacy having also trialled CloudBerry, Restic and Duplicati. I’ve been pretty impressed so far.

I have around 3TB of data I’m looking to back up to B2. All of this data is stored in a single folder and so could be set up as a single repository.

This said, the parent folder does have a small number of fairly equally sized sub-folders that could instead be set up individually as separate repositories.

I wondering if there is any benefit (or dis-benefit) to doing this? I’m not sure on the inner workings of Duplicacy but one thought I had was that this could mean smaller separate local databases/caches rather than large ones?

Any thoughts would be appreciated.

Welcome to the forum!

Duplicacy works by clumping all files together and then shredding that sausage into pieces, with smart optimizations to facilitate efficient deduplication. You can review the design document here duplicacy/DESIGN.md at master · gilbertchen/duplicacy · GitHub

Reducing size of the datastore makes sense if your storage backend is a high latency one (such as google drive or dropbox) to minimize impact of slow list requests on folders with massive amount of files by reducing the number of chunks in the datastore. (Alternatively you can configure the depth of directory structure where chunks are stored, this will help to a degree)

Your storage is B2 does not have this issue, it’s designed for this sort of use so splitting database for this reason is not useful.

Another reason might be if you want to implement different backup retention policies and schedules on different subsets of your data. You can create distinct repositories in a folder and symlink stuff there or use different filters. Duplicacy follows first level symlinks making this possible

The drawback is that you have to do more work setting it all up.

TLDR: no, I would not split the repository, just to keep everything as simple as possible

2 Likes

@saspus - Very useful info, thank you for spending the time to come back to me.

I don’t really dare to contradict you, but I was planning to write a post in which I would conclude that I should have split up my repository into multiple smaller ones because everything just seems to take so long (see here for example. That check job is running for over 40 hours now and it’s not checking all but only a single repository snapshot)…

Webdav is a worst possible backend for bulk storage. Its latency is ridiculous — and not really unexpected based on its design goals. This falls under umbrella in the second paragraph.

But I would still recommend moving to a better suited backend first. When there are so many alternatives — why stick with pcloud? And if your dataset is large — more so.

1 Like

Both options (unique x separated repos) have advantages and disadvantages (like everything, after all).

I personally prefer the most capillarized configuration (various repositories and storages), it’s easier to adopt different prune and scheduling configurations.

Ok, good to know. So maybe I should add another depthlayer to my directory structure? Are you referring to this?

I suspect I can’t change the directory structure on my existing storage… (Although technically, it should be possible for duplicacy to move the existing chunks to their respective subdirectories and rename them. But implementing that is probably not worth the effort.)

Cost. I bought 4 TB of lifetime storage from pCloud and I’m still hoping to eventually get it to work. Backups via cli have been running smoothly for quite a while now. It’s the other commands that are causing issues.

At B2 I’d be paying $20 per month for that and tbh I don’t want to pay that much for my backup storage. Then I might as well just use Backblaze personal backup…

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.