Question about backing up a large volume to the cloud

cheetahsquirrelporcu · 9 July 2021 22:11

Let’s say I have a large amount of data and want to back it up to the cloud, over a residential broadband connection. As an example, say everything is in /My_Files/:

My_Files/Folder_A/  (0.5 TB)
My_Files/Folder_B/  (0.5 TB)
My_Files/Folder_C/  (2.0 TB)
My_Files/Folder_D/  (2.0 TB)
My_Files/Folder_E/  (2.0 TB)
My_Files/Folder_F/  (2.0 TB)
My_Files/Folder_G/  (1.0 TB)

Backing up a large dataset like this for the first time is naturally going to take some time.

Let’s say I want to back up the most important data first (e.g., the small folders, A & B), and then add other, bigger folders later.

My question…

Would it be best to:

Create a new backup in the backup tab (Duplicacy GUI)
Using /My_Files/ as the source folder to backup
add folders C, D, E, F, G to the “exclude” list for the backup job
run the backup
wait days/weeks/months to finish
remove C, D, E, F, G from the “exclude” list for the backup job
run the backup

Is this the right approach, or do I need to back up everything all at once?

Any disadvantages?

Droolio · 10 July 2021 18:17

Honestly this is a good strategy but you could do it another way…

At first, you could set up a backup repository at the root of each My_Files/Folder_A/ and _B etc…

Once all your folder backups complete, you could create a new backup repository from the My_Files root and run it again. Since most of the chunks already exist in the backup storage, this is just about creating a single backup tree with new metadata. Then you could delete the old _A etc. IDs.

Personally, I’d probably break up all my data into separate repositories anyway, and keep them that way. Depending on how many files are in each folder, that might be better on RAM usage anyway.

Another approach might be to break the data up into even smaller folders/sub-folders so your initial backups complete sooner, and then you can gradually add more folders and run incremental backups.

Remember, you won’t be able to do a restore unless you have at least one snapshot to restore from, and if it literally takes days/weeks/months to finish the initial backup, that’s a dangerous position to be in. So, let a backup complete early, add more data later.

cheetahsquirrelporcu · 13 July 2021 00:33

At first, you could set up a backup repository at the root of each My_Files/Folder_A/ and _B etc…

I’m still new to Duplicacy, so if you don’t mind me clarifying a few things…

When you use the term “repository”, how do you relate that to the Web UI? I did a CTRL-F on the Web GUI Guide and got no results for the term “repository”.

Repository == a backup task on the backup tab?

For example, are you suggesting:

FIRST,

Create a backup task for Folder_A/
Run it to completion
Create a backup task for Folder_B/
Run it to completion
Repeat for all folders, C, D, E, F, G.

THEN,

Create a backup task for the root folder for folders A, B, C, D, E, F, G, – i.e., /My_Files/
Run to completion
(should be fast because all the chunks are already uploaded via the individual backups, is that right?)
Delete all the individual backup tasks for A, B, C, D, E, F, G
Rock and roll with 1 backup task forever (i.e., /My_Files/ backup task)

Did I get that right?

Droolio · 13 July 2021 11:57

A repository in Duplicacy terminology is the backup source files for a single backup ID (yea, I know, maybe a bit confusing but I gather it’s based similar to git’s naming convention?).

That’s essentially it - unless you wanna faff around with the alternative of filters, but the end result is identical.

Again, personally, I might leave individual backup IDs for each folder but that depends on how organised your directory structure is and if you’ll be adding new folders. If you have a lot of files and not much RAM, breaking it up may help.

towerbr · 14 July 2021 16:05

I always prefer a more granular configuration for backups (which in your case would be a backup per folder A, B, etc.)

Some advantages:

better control over schedules (I have a configuration similar to yours, with a smaller folder that is more important and constantly updated, and this folder has daily backups. The others - which do not have constant changes - have backups at longer intervals (weekly) and some even have backups triggered manually);
better control over prune setting;
less concerns about filters;
easier/targeted restores (in case I ever need to);

and others.

About terminology, take a look:

cheetahsquirrelporcu · 15 July 2021 04:14

A repository in Duplicacy terminology is the backup source files for a single backup ID (yea, I know, maybe a bit confusing but I gather it’s based similar to git’s naming convention?).

That’s essentially it - unless you wanna faff around with the alternative of filters, but the end result is identical.

Perfect! Thank you.

Again, personally, I might leave individual backup IDs for each folder but that depends on how organised your directory structure is and if you’ll be adding new folders. If you have a lot of files and not much RAM, breaking it up may help.

I have TBs of small files (< 20MB), and they’re well organized. As for RAM, I just have a basic Synology DS1819+ with 4GB RAM. I have Duplicacy Web UI running on a spare laptop with 8GB RAM.

Is that enough? (Were you speaking of the the memory on the NAS or the laptop?)

cheetahsquirrelporcu · 15 July 2021 04:38

I always prefer a more granular configuration for backups (which in your case would be a backup per folder A, B, etc.)

Thank you. As I think about this, it really sounds like the way to go. The advantages you point out are compelling!

One concern…

My current daily, local Duplicacy backups (where source is a local NAS, destination is a local NAS) take a very long time to complete. Sure, the first backup took over a week to complete, but every subsequent backup takes 5 hours to complete. Even if nothing has changed with my data, daily backups reliably take almost exactly 5 hours. Is this expected?

Now, 5 hours to run a daily backup hasn’t been a problem, but what happens when a scheduled backup starts before the other finishes? Isn’t this going to be a performance nightmare? I imagine this could be the case with a dozen different scheduled backups, as is the plan. Is my solution to precisely schedule each of the dozen or so backups so the timing never overlaps?

**More info about my current setup: my primary NAS (local) is being backed up to another NAS (local). Both have 4GB RAM. Duplicacy Web UI is running on a spare laptop (local) with 8 GB RAM. The local network is 1 GbE, and wired. Future setup will include all that, plus regular backups to the cloud.

towerbr · 15 July 2021 11:38

Now I have a better understanding of your setup.

First point: your memory concerns should be focused on your laptop. The two NAS are acting just like… NAS. 8Gb seems to me more than enough. You can find some topics related to this subject here on the forum, like this, this and this.

It really depends on the number of files:

Are we talking about millions of files? This will really take time…

Probably yes…

~~I searched but couldn’t find a post here with someone with a similar situation to yours. If I remember correctly, the solution was to create a script with the backups in sequence.~~

EDIT: I just remembered that in the web version you have the option of not checking the “parallel” option in jobs. So if you put all your backups in the same job, you shouldn’t have problems:

duplicacy_web_job_added

In short, points to evaluate in your setup:

laptop memory (8 GB seems ok to me, but you should evaluate the number of files)
network speed (1 GbE wired seems to me more than enough)
speed of the disks of the two NAS

cheetahsquirrelporcu · 29 July 2021 04:07

I just wanted to say thanks again. You helped me really think through my backup strategy and optimize it. I’ve since reorganized my folders on the disk in a way such that I can both have 1 Duplicacy backup per folder (as you recommended), AND maintain granular scheduling of the folders for optimal performance. (E.g., the most frequently used folder(s) back up daily, and the others weekly or monthly.)

Are we talking about millions of files? This will really take time…

I checked the storage manager, and yeah, about a million, so I suppose 5 hours to run a backup despite no changes, is to be expected. (Given laptop memory of 8GB, 1GbE wired, Synology Hybrid Raid 1 with 4x IronWolfs in one NAS and 3x IronWolfs in the other)

I signed up with Google Workspace recently with the intent on using it as Duplicacy storage, but just found an old thread (and helpful remarks from you and Saspus) indicating it was a poor choice. Cancelling and going with something like B2 sounds like the best bet at this point, but a query: how should I think about the upload/download fees?

What is Duplicacy actually doing – from a technical perspective – that takes 5 hours on my local NAS? I assume it’s doing some kind of comparison – but how?

And more to the point, what will this mean in terms of upload/download fees on a 'true" cloud provider like B2/S3/Azure/etc.?

If I go with, say, B2, are daily incremental backups going to cost me a fortune in fees?

towerbr · 30 July 2021 13:51

I think it’s a question of price vs. quality. B2 is better (since it’s storage) than GDrive, it’s not foolproof (nothing is), but in years of use I haven’t had a single problem. Within the storage options, I think it’s the best cost-benefit ratio. IMHO what guarantees a good backup is the 3-2-1 approach.

From my understanding, Duplicacy lists the files in the local repository (very fast!) and compares them to the snapshot on storage, to define which chunks should be uploaded. For further technical details I leave it to @gchen.

Below are some of my invoices as an example. I know that in my case it’s hundreds of GB’s and hundreds of thousands of files, and in your case it’s an order of magnitude bigger, but only as an example. Notice in a few months a few transactions above the free limit, probably some moves or prunes above normal.

Important info: some of these backups have hundreds of revisions and the size is very close to the repository size. That’s deduplication at work. Of course, this gets lost a bit for the constant addition of new files. In this case, storage growth will roughly follow that of the repository, and prune becomes important.

2021-04-01 - ref 2021-03 - billed products

2021-03-01 - ref 2021-02 - billed products

2021-02-01 - ref 2021-01 - billed products

cheetahsquirrelporcu · 30 July 2021 19:23

B2 is better (since it’s storage) than GDrive, it’s not foolproof (nothing is), but in years of use I haven’t had a single problem.

What’s the most common problem you see with B2?

IMHO what guarantees a good backup is the 3-2-1 approach

I currently have a 2-2 strategy (two copies, on two NAS boxes with 1 disk fault tolerance and BTRFS). My intent is to ultimately have 3-2-1, and all this cloud jazz will deliver the the “1”.

Duplicacy lists the files in the local repository (very fast!) and compares them to the snapshot on storage, to define which chunks should be uploaded. For further technical details I leave it to @gchen

I’d be grateful for any technical details from @gchen . You see, despite having a data set that is 8 TB (and growing), and over a million files, this is just a personal hobby and I can’t afford to have a big fat surprise B2 bill because of a handful of technical gotchas I didn’t consider.

For example, using B2’s Cloud Storage Calculator with a hypothetical 8TB and 100 GB monthly upload priced below. In my naïveté, I would expect my annual cost to be what’s calculated: ~$518.25.

You can understand my dilemma now, as I learn about “API calls” and download fees, coupled with the opacity of what Duplicacy is actually doing when it takes 5 hours for these no-changes, incremental backup from my local NAS1 to my local NAS1, or what fees I’ll get hit with when I prune, etc… I’m very nervous about a surprise billing that I can’t afford to have.

So I’m trying my best to do my due diligence up front, and make sure my assumptions are correct so there will be no billing surprises.

All else equal, I’d backup the 8TB to the cloud, and run daily increments. But it’s not at all clear to me if that would cost an absolute fortune, above and beyond what the calculator says.

towerbr · 30 July 2021 21:37

As I said, in years of use I didn’t have a single problem (unlike Wasabi, which I had some problems with), very satisfied with their service.

I understand your concern about a perhaps-big cost with B2. This calculator they provide is not complete because it does not consider transactions (nor would it be possible). I can tell you a few points:

most transactions used for backup are class A and therefore free;
for the most used class B and C transactions, it will really depend on the number of files in your repository and the frequency of your backups.

Ref: Pricing Organized by API Calls

I think it will be difficult for you to have an accurate assessment just by performing calculations and simulations, there are many variables involved.

My suggestion: do a partial backup/upload to B2 with a subset of files that well represent your 8TB and monthly upload volume. Use this for a month and evaluate.

If you consider it expensive, just cancel the account. You will only have paid for one month.

You will then have the options:

Go to a drive-type storage (Google Drive, Onedrive, etc.)
Create an off-site personal storage, such as a NAS in a relative’s home.

(I don’t put another storage provider as an option because you will have already evaluated the cheapest, B2).

An additional note: from the numbers you used in the calculator above, you generate approximately 1TB per year (that’s a lot!). Considering this volume, added to the initial 8TB, cloud storage will be expensive, no matter which provider you use.

mfeit_duplicacy · 5 August 2021 14:56

The full source code for Duplicacy is available for inspection. If your time is cheap and you’ve got the chops, read it, learn how it works and apply that knowledge to your understanding of how your data set changes over time.

Unless you’re backing up a pathologically-bad data set, I’d be more inclined to pin that on the performance of your NASes or the network between them than I would on Duplicacy. The largest of my systems (1.5M files, 2 TB) is a desktop computer that takes five and a half minutes on an average night to do an incremental to a directly-attached disk and about six to bring Wasabi up to date.

Personally, I think you’re making a mountain out of a mole hill. API calls aren’t very expensive and the egress costs can be kept down by doing restores from your local backup.

As an example, Wasabi, which is nominally S3-compatible, says I’m doing about 15,000 API calls in an average month against my Duplicacy bucket. Let’s say your consumption ends up being ten times mine (unlikely) and that all requests turn out to be in B2’s most-expensive tier, class C. Before you weed out the calls B2 gives you at no charge, you’re looking at about $0.60 per month for use of the API. And that’s pretty much the worst case I can concoct for what should be average use.

More realistically, most of your transactions will be class A, some will be class B and very few will be class C. If you assume a 65/30/5 split among the 150,000 figure, they’re going to be into you for about a nickel a month in API charges even with the freebies counted.

If absolute predictability in billing is paramount, B2 isn’t the right service. Wasabi, which I’ve been using since 2017 without incident, is $6.00 per TB per month with unlimited ingress, egress and API calls. You will, undoubtedly, point out that this is more expensive than B2, but that’s offset by not having to worry about utilization costs beyond what you store.

saspus · 5 August 2021 17:14

Something is wrong here. For reference, my Users folder on a Mac (with SSD drive) I’m backing up is about 1.3M files, and without changes the backup to Google Drive is taking about 5 minutes, while CPU throttled to 40%.

5 hours is definitely excessive. However if you have not enough RAM (and 4GB is definitely not enough, you would want 16 or 32) on the source nas to cache metadata likely the seek latency enumerating files murders performance. Consider adding SSD cache or adding ram.

To confirm you can add -d flag to backup and watch logs, or watch disk queue on your nas.

That comment was about *Drive type services due to misaligned incentives and poor performance as a result, but IIRC the comment ended with a remark that Google Drive works fine. I’m using Google Workspace myself.

The limitations are not-deal breakers: 400,000 max items in Shared Drive (don’t backup to Shared Drives) and max 750GB/day ingress (my internet connection won’t allow this to happen anyway)