Merging repositories

normanstetter · 20 July 2019 10:15

Hi,

I am in the process of migrating my data from single HDDs to a zfs pool.

Right now I have 3 HDDs, each of which is setup as a repository as a whole. If I merge all HDDs into one zfs pool, can I merge the repositories as well. Or is there some other way of avoiding a reupload of all data?

Christoph · 20 July 2019 22:01

Welcome to the forum!

duplicacy operates at the file level, so it is irrelevant how you set up your file-system or how many hard drives you pool together.

Yes, you can. That’s what duplicacy is all about:

normanstetter · 21 July 2019 10:14

I think I don’t understand this quite right.
My setup:

3 hdds, each with a directory structure like this:

.
├── data
│   ├── HDD1_DATA1
│   └── HDD1_DATA2
├── .duplicacy
│   ├── cache
│   ├── logs
│   ├── preferences
│   └── scripts

with storage set to:

gcd://Backup/nas/data01
gcd://Backup/nas/data02

etc.

So what do I need to do, to end up with a single duplicacy repo like this:

├── HDD1_DATA1
├── HDD2_DATA1
├── HDD3_DATA1
├── .duplicacy

with

gcd://Backup/nas/

as storage?

Thank you very much for the help.

Droolio · 21 July 2019 12:36

First of all, it looks like you have ~~two~~ three separate storages. Even with separate HDDs (multiple repositories), you could have pointed all three repositories to a single storage, and benefited from greater de-duplication.

Not to worry, but you’re gonna have to pick one of those storages going forward (the largest?), as unless you created the others with add -copy (copy-compatible), you won’t be able to merge them.

But with this new-found knowledge that you can have multiple repositories (from multiple computers, even) backing up to a single storage, you can see that it doesn’t really matter how you organise your source data. Yes, you can have a single repository backing up all three HDDs (or pool), if that’s what’s easier. Or you can keep the current one repository per drive (or current structure), or create as many base directories for as many repositories as you want.

It will all de-duplicate, with minimal re-upload. (Aside from the fact that you can’t merge those storages, so you will have to re-upload 2 of the drives after picking one of the storages as your main storage going forward.)

You can and should, of course, keep the deprecated storages around until you don’t need the historic snapshots any more.

normanstetter · 21 July 2019 14:02

Ok, now I get it. The storage is the determining part.

So without having used -copy when creating the storages, I can’t merge them by just putting all chunks from all 3 storages together?

But I can simply copy the storage I want to go forward with to another folder on my GDrive and use it as storage for all my repos?

Droolio · 21 July 2019 15:21

Yup.

You can even just move/rename, instead of copying, the storage on Google Drive if you want to reuse it. Snapshots from old repository IDs will remain in there until they’re actively pruned. Best practice imo would be to use new repository ID(s) for your new backups.

normanstetter · 21 July 2019 16:28

To be completely shure, what I have to do is the following:

Move the biggest existing storage. gcd://Backup/nas/data01 -> gcd://Backup/newstorage
Init a new repo, using the existing storage: duplicacy init newrepo gcd://Backup/newstorage
The config gets pulled from the storage and the now missing files get uploaded
When uploading has finished, pruning snapshots from the repos I no longer use

Is this correct?

And one follow up question: If multiple repos can use the same storage, where is the info saved which files belong to which repo?

Droolio · 21 July 2019 19:59

When you issue this command, it should tell you that it’s already been initialized and it will create a .duplicacy directory at the root of the repository.

IF you happen to already have a .duplicacy directory, you can re-use it, but then you’d need to edit the preferences file to point to the new storage location, or give the storage a different name (with `-storage-name’. (Or re-use the repository ID, in which case it’s already initialised and ready to backup.)

Personally, I would rename it out of the way .duplicacy.old and give new repository IDs to distinguish between old and new setup.

If you do this, note that Duplicacy will rehash everything and take a bit longer, since it’s the first backup of the new repository, but that it will skip chunks that already exist on the storage.

Minor point: the config stays on storage and gets pulled upon each backup but yes, anything missing gets uploaded.
Completely up to you, or use the same retention rules (prune -keep) for the whole of the storage and old snapshots can get removed based on age, in the same way as newer snapshots do.

To answer your final question, the storage has a /snapshots folder and each snapshot file which is usually numbered 1, 2 etc., references which chunks are used.

Multiple snapshots can reference the same chunks (de-duplication) and the genius part is how it prunes unused chunks, through the lock free de-duplication algorithm, as @Christoph linked to .

normanstetter · 22 July 2019 10:43

Thanks a lot. It worked.

That’s pretty impressive!

Christoph · 22 July 2019 14:35

For anyone: Feel free to use the button on the posts that you found useful.

For the OP of any #support topic: you can mark the post that solved your issue by ticking the under the post. That of course may include your own post