Question about backing up a large volume to the cloud

Let’s say I have a large amount of data and want to back it up to the cloud, over a residential broadband connection. As an example, say everything is in /My_Files/:

My_Files/Folder_A/  (0.5 TB)
My_Files/Folder_B/  (0.5 TB)
My_Files/Folder_C/  (2.0 TB)
My_Files/Folder_D/  (2.0 TB)
My_Files/Folder_E/  (2.0 TB)
My_Files/Folder_F/  (2.0 TB)
My_Files/Folder_G/  (1.0 TB)

Backing up a large dataset like this for the first time is naturally going to take some time.

Let’s say I want to back up the most important data first (e.g., the small folders, A & B), and then add other, bigger folders later.

My question…

Would it be best to:

  1. Create a new backup in the backup tab (Duplicacy GUI)
  2. Using /My_Files/ as the source folder to backup
  3. add folders C, D, E, F, G to the “exclude” list for the backup job
  4. run the backup
  5. wait days/weeks/months to finish
  6. remove C, D, E, F, G from the “exclude” list for the backup job
  7. run the backup

Is this the right approach, or do I need to back up everything all at once?

Any disadvantages?

Honestly this is a good strategy but you could do it another way…

At first, you could set up a backup repository at the root of each My_Files/Folder_A/ and _B etc…

Once all your folder backups complete, you could create a new backup repository from the My_Files root and run it again. Since most of the chunks already exist in the backup storage, this is just about creating a single backup tree with new metadata. Then you could delete the old _A etc. IDs.

Personally, I’d probably break up all my data into separate repositories anyway, and keep them that way. Depending on how many files are in each folder, that might be better on RAM usage anyway.

Another approach might be to break the data up into even smaller folders/sub-folders so your initial backups complete sooner, and then you can gradually add more folders and run incremental backups.

Remember, you won’t be able to do a restore unless you have at least one snapshot to restore from, and if it literally takes days/weeks/months to finish the initial backup, that’s a dangerous position to be in. So, let a backup complete early, add more data later.

1 Like

At first, you could set up a backup repository at the root of each My_Files/Folder_A/ and _B etc…

I’m still new to Duplicacy, so if you don’t mind me clarifying a few things…

When you use the term “repository”, how do you relate that to the Web UI? I did a CTRL-F on the Web GUI Guide and got no results for the term “repository”.

Repository == a backup task on the backup tab?

For example, are you suggesting:

FIRST,

  • Create a backup task for Folder_A/
  • Run it to completion
  • Create a backup task for Folder_B/
  • Run it to completion
  • Repeat for all folders, C, D, E, F, G.

THEN,

  • Create a backup task for the root folder for folders A, B, C, D, E, F, G, – i.e., /My_Files/
  • Run to completion
  • (should be fast because all the chunks are already uploaded via the individual backups, is that right?)
  • Delete all the individual backup tasks for A, B, C, D, E, F, G
  • Rock and roll with 1 backup task forever (i.e., /My_Files/ backup task)

Did I get that right?

A repository in Duplicacy terminology is the backup source files for a single backup ID (yea, I know, maybe a bit confusing but I gather it’s based similar to git’s naming convention?).

That’s essentially it - unless you wanna faff around with the alternative of filters, but the end result is identical.

Again, personally, I might leave individual backup IDs for each folder but that depends on how organised your directory structure is and if you’ll be adding new folders. If you have a lot of files and not much RAM, breaking it up may help.

1 Like

I always prefer a more granular configuration for backups (which in your case would be a backup per folder A, B, etc.)

Some advantages:

  • better control over schedules (I have a configuration similar to yours, with a smaller folder that is more important and constantly updated, and this folder has daily backups. The others - which do not have constant changes - have backups at longer intervals (weekly) and some even have backups triggered manually);
  • better control over prune setting;
  • less concerns about filters;
  • easier/targeted restores (in case I ever need to);

and others.

About terminology, take a look:

1 Like

A repository in Duplicacy terminology is the backup source files for a single backup ID (yea, I know, maybe a bit confusing but I gather it’s based similar to git’s naming convention?).

That’s essentially it - unless you wanna faff around with the alternative of filters, but the end result is identical.

Perfect! Thank you.

Again, personally, I might leave individual backup IDs for each folder but that depends on how organised your directory structure is and if you’ll be adding new folders. If you have a lot of files and not much RAM, breaking it up may help.

I have TBs of small files (< 20MB), and they’re well organized. As for RAM, I just have a basic Synology DS1819+ with 4GB RAM. I have Duplicacy Web UI running on a spare laptop with 8GB RAM.

Is that enough? (Were you speaking of the the memory on the NAS or the laptop?)

I always prefer a more granular configuration for backups (which in your case would be a backup per folder A, B, etc.)

Thank you. As I think about this, it really sounds like the way to go. The advantages you point out are compelling!

One concern…

My current daily, local Duplicacy backups (where source is a local NAS, destination is a local NAS) take a very long time to complete. Sure, the first backup took over a week to complete, but every subsequent backup takes 5 hours to complete. Even if nothing has changed with my data, daily backups reliably take almost exactly 5 hours. Is this expected?

Now, 5 hours to run a daily backup hasn’t been a problem, but what happens when a scheduled backup starts before the other finishes? Isn’t this going to be a performance nightmare? I imagine this could be the case with a dozen different scheduled backups, as is the plan. Is my solution to precisely schedule each of the dozen or so backups so the timing never overlaps?

**More info about my current setup: my primary NAS (local) is being backed up to another NAS (local). Both have 4GB RAM. Duplicacy Web UI is running on a spare laptop (local) with 8 GB RAM. The local network is 1 GbE, and wired. Future setup will include all that, plus regular backups to the cloud.

Now I have a better understanding of your setup.

First point: your memory concerns should be focused on your laptop. The two NAS are acting just like… NAS. 8Gb seems to me more than enough. You can find some topics related to this subject here on the forum, like this, this and this.

It really depends on the number of files:

Are we talking about millions of files? This will really take time…

Probably yes…

I searched but couldn’t find a post here with someone with a similar situation to yours. If I remember correctly, the solution was to create a script with the backups in sequence.

EDIT: I just remembered that in the web version you have the option of not checking the “parallel” option in jobs. So if you put all your backups in the same job, you shouldn’t have problems:

duplicacy_web_job_added

In short, points to evaluate in your setup:

  • laptop memory (8 GB seems ok to me, but you should evaluate the number of files)
  • network speed (1 GbE wired seems to me more than enough)
  • speed of the disks of the two NAS
1 Like

I just wanted to say thanks again. You helped me really think through my backup strategy and optimize it. I’ve since reorganized my folders on the disk in a way such that I can both have 1 Duplicacy backup per folder (as you recommended), AND maintain granular scheduling of the folders for optimal performance. (E.g., the most frequently used folder(s) back up daily, and the others weekly or monthly.)

Are we talking about millions of files? This will really take time…

I checked the storage manager, and yeah, about a million, so I suppose 5 hours to run a backup despite no changes, is to be expected. (Given laptop memory of 8GB, 1GbE wired, Synology Hybrid Raid 1 with 4x IronWolfs in one NAS and 3x IronWolfs in the other)

I signed up with Google Workspace recently with the intent on using it as Duplicacy storage, but just found an old thread (and helpful remarks from you and Saspus) indicating it was a poor choice. Cancelling and going with something like B2 sounds like the best bet at this point, but a query: how should I think about the upload/download fees?

What is Duplicacy actually doing – from a technical perspective – that takes 5 hours on my local NAS? I assume it’s doing some kind of comparison – but how?

And more to the point, what will this mean in terms of upload/download fees on a 'true" cloud provider like B2/S3/Azure/etc.?

If I go with, say, B2, are daily incremental backups going to cost me a fortune in fees?