Diabolical download speeds off OneDrive using Duplicacy

saspus · 21 August 2022 03:59

you need to run it in the folder where the storage is initialized. Ideally, run it on your desktop, to remove unraid from the picture entirely.

Otherwise, go to the backup or check logs in WebGUI and in the first few lines will be a path to a temp folder.
You need to cd to that folder and run duplicay benchmark (with the ful path) in there

rjcorless · 21 August 2022 08:47

I’m only running Duplicacy on the Unraid at the moment, not on my PC. Because I want a quick-as-possible restore process, I’m using a combo of EaseUS ToDo for the ‘bare metal’ backup, and I’ve also set it up to run a ‘smart’ backup which runs every 30 minutes across all my documents and stores locally on another (non OS) drive in my PC (this runs Incremental backups throughout the day than as midnight rolls over it creates a differential backup for the entire previous day and re-starts incrementals for the next day.) Then I use Syncbackpro to transfer to my Unraid NAS at the end of the day. I was just going to use Duplicacy to provide that cloud based back up from the Unraid server.

Synbackpro can save directly to the cloud as well, including S3 (which ToDo can’t).

As you may tell, my ‘backup’ strategy is a bit of a patchwork quilt precipitated by me building my first NAS with Unraid on an old re-purposed Sandybridge MB with an Intel 2500K and 16GB Ram. On oldy but a goody! It has a very stable Overclock of up to 4.5Ghz, great cooling and 8 SATA ports. Perhaps my strategy is more a reflection of ignorance more than it is necessity!

rjcorless · 21 August 2022 10:39

Eventually got the benchmark to run. Tried a few different # threads. This was 50 down and 8 up

Generating 256.00M byte random data in memory
Writing random data to local disk
Wrote 256.00M bytes in 0.22s: 1183.77M/s
Reading the random data from local disk
Read 256.00M bytes in 0.11s: 2389.16M/s
Split 256.00M bytes into 52 chunks without compression/encryption in 1.30s: 196.19M/s
Split 256.00M bytes into 52 chunks with compression but without encryption in 1.94s: 132.24M/s
Split 256.00M bytes into 52 chunks with compression and encryption in 2.13s: 120.03M/s
Generating 64 chunks
Uploaded 256.00M bytes in 74.31s: 3.44M/s
Downloaded 256.00M bytes in 15.87s: 16.13M/s
Deleted 64 temporary files from the storage

Looks fine to me!

Droolio · 22 August 2022 00:01

Says who? It functions perfectly fine as such! I’ve used it successfully with Duplicacy (and Rclone) for personal, company, and clients - for many years. I’ve not come across any users - some who store petabytes of stuff in there - who complain of scaling issues, data loss or corruption… other than the well-known, quite reasonable, rate limits - i.e. 750GB/day U/L, 10(!)TB/day D/L, 10 transactions/sec - fair, ample, and doesn’t cost a penny more for the privilege.

Now, if you want to claim it wasn’t designed that way; citation needed. Let’s see what Google says here:

Perhaps because the userbase is tiny and other, much less costly, solutions exist and are used instead?

Perhaps they reasoned the pricing structure for GCS et al isn’t exactly straightforward to predict against unknown de-duplication efficiency and data growth.

Perhaps they want to adopt good backup practice by regularly testing restores?

Even IF Duplicacy supported archival level storage, Workspace Enterprise has unlimited storage and is already better value than anything demanding more than a few TBs.

You’re screwed when you actually have to do a restore though.

saspus · 22 August 2022 04:05

My personal experience. It’s an extra layer between me and GCS that I don’t need. I had issues with it. I did not have issues with GCS.

And that’s what distinguishes google from others, like dropbox and onedrive. Still, does not change anything in my reasoning.

I don’t have access to statistics.

Good one With AWS you can restore for free/at very low cost certain amount of data monthly to cover this usescase

The problem with this is that “unlimited” won’t last forever. They will eventually fix the quotas issues and start enforcing them. When will this happen? No idea. But i’d rather not have to be left hanging, and I would like to pay for what I use, and I don’t want to participate in some shady averaging and “unlimited” claims. Actually, they never use the word unlimited. They use “as much as needed”. And when they decide to cut down on abusers like your clients – its anybody’s guess.

I choose not to play this game and just pay for what I use. It’s transparent, straightforward, and fair.

Home users rarely need to restore everything at once. Restoring slowly in small chunks is from free to very low cost.

Droolio · 24 August 2022 00:00

The change in terminology isn’t surprising. “Unlimited” has always been a misnomer - implies infinite, which of course, is impossible. Obviously there’ll be a hidden, fair use, limit with anything - yet they’d have to work hard to justify encroachment on the “as much as you need” marketing. A dozen or so TB isn’t gonna fuss Google when they haven’t fussed previously over many 100TB+ and even peta users, and it’s not like Google is suddenly desperate to reclaim storage space. They also know they’re unlikely to make much revenue from these users by bringing it down significantly, so why risk a good selling point?

I, like everyone else, was migrated from G Suite to Workspace, for marginally more cost per month. Even the full wack Enterprise Plus would only cost me £23 from next April (after my 50% discount expires). That’s still quite a bargain for what I need it for.

Incidentally, neither I nor my clients are abusers - I use what I need, and our clients’ use comes under the 5TB you get with Business Plus for £13.80pm. Which still works out to be less than Glacier without restores. Need more? Add more users and pool data. Still cheaper, free restores.

The colder the storage, the more expensive that restore process is.

IMO, ‘archival’ tier storage should be for… archives. Not continuous backups.

Though I’m eager to see Duplicacy support separate metadata (primarily so I can duplicate that directory in my pools for added redundancy), I strongly suspect those intrepid arctic explorers be disappointed at the cost savings when all’s factored in.

saspus · 24 August 2022 05:15

Understood. This is not unreasonable, there will be some users that are happy with the service (I was one of them myself, and was saying almost the same things almost verbatim). It’s probably also possible to find other corner cases and niches to get the best deals in every areas. They may change from time to time – for example, before google there was unlimited amazon drive. It was perfect in the same way – use as much as you want, it works, based on AWS, with a thin layer. It went away. People found hubic. which also went away. Now people found g-suite. While S3 kept working just the same.

I thought hunting bargains was worth my time. Now I think it ins’t. I guess this is the underlying reasoning.

It seems that way on the surface. But at the closer look – hot storage is suitable for fast access. That’s what it is good at. Backups don’t benefit from that performance at all. In fact, amazon themselves suggest Glacier for backups here:

**S3 Glacier Flexible Retrieval (Formerly S3 Glacier)***** - For long-term backups and archives with retrieval option from 1 minute to 12 hours

Yes. Backup is like insurance. One shall plan the risk-adjusted cost. The expectation is to never have to restore the whole thing, while restoring few things here and there occasionally. Archival tiers are most cost effective for these consequences.

I.e. you are not optimizing (cost of storage * years + cost of restore), you need to optimize cost of storage * years + risk of loss * cost of restore. And this makes restore cost pretty much irrelevant.

I calculated the cost. It can be very high, if you want everything at once, or very low, if you don’t.
Here is a back of the napkin old calculation Arq posted: AWS Glacier Pricing - How to Calculate the Real Cost | Arq Backup. Today it’s cheaper.

And here where I disagree with the approach: all data needs to be guaranteed immutable. Not just metadata. If I doubted my storage – I woudl not duplicate mediate, I would replace storage.

Droolio · 24 August 2022 12:22

ISTM these object storages are designed around the premise of using with cloud compute, and egress to internet is not the usual. I doubt such long wait times for objective retrieval would work too well with a backup tool like Duplicacy - even where metadata was hot. Supporting archival tier may involve quite a bit more work than isolating metadata. Maybe a copy to local might be most feasible.

And what to do meanwhile… Duplicacy doesn’t support it yet.

Nothing is guaranteed immutable - that’s the point. There really is nothing wrong with adding redundancy when it’s extremely minimal effort and cost (metadata amounts to very little). At the flick of a couple check boxes, I could have Stablebit DrivePool duplicate that directory on however many drives I want - a configuration similar to SnapRAID’s content files.

Given that Duplicacy backup storage is easy to ‘repair’ by copying missing chunks from offsite storage, duplicating metadata is just another layer of redundancy that expedites a DR scenario.

rjcorless · 26 August 2022 11:23

Final Update:
Couldn’t put up with the thought of a ~2MB/s restore speed off OneDrive (even if it was just once per year), so I’ve used my Amazon AWS account to use their S3 storage offering.

So far, weekly full backups and daily differentials on windows documents files are being uploaded automatically at the end of each day (even though there are many incrementals being saved to Unraid every 30 minutes that I can also use quickly if I need to.

Currently have a full weekly backup and daily incrementals on a two week rotation for a bare metal restore. Given the daily differentials and unraid incrementals during the day appears to work OK, and given I don’t update/install new updates or programs to my windows PC, I may review this further and just run a Full backup weekly and say keep a month’s worth which is probably about 350GB in total. May just leave it at two through. Not decided.yet.

Looks like this will cost me about $3 per month which is nothing really. But really happy with both upload and download speeds that I can achieve with AWS. Will just do me fine.

Thanks for the insight and discussion! Very interesting and something I have never bothered with before!

rjcorless · 13 October 2022 14:22

UPDATE:
Well, in so much as the actual ‘storage’ cost for my 600GB rolling backup resulted in a nominal monthly charge of about £5, the charge for getting the data onto AWS within the month cost me another £79. What the heck? As I cannot see my strategy changing in backing up a ‘bare bones’ image, there is no way ai am going to pay £84 per month for uploading and storing just 600GB. I’m out!!

saspus · 14 October 2022 02:15

Not sure how did you get these numbers, check the bill

My upload of about 2TB to AWS cost me about $20 and then I pay $3 monthly for storage and incremental updates.

The idea being upload and download cost are irrelevant, you do it only once, but storage cost matters — you pay it forever.

Your numbers are however cray high. What tier have you used? Glacier flexible retrieval is what you want as there are no thawing involved, until Duplicacy supports archival storage properly.

Edit: since you are backing up an image you might have very high data turnover. This will be very expensive as low cost tiers penalize early deletion. Set chunking to fixed to combat it somewhat. But I would reconsider backing up images, even if you think it won’t work for you.

rjcorless · 14 October 2022 21:43

Seems a tad excessive for < 700GB upload. Will need to rethink this one!

saspus · 15 October 2022 09:30

So, your storage cost is $9.65/month, likely because you are using a very expensive “S3 Standard” storage tier. Instead, you shall use “S3 Glacier Instant Retrieval” or “S3 Intelligent - Tiering”, with cost structure that is better aligned with backup usecase.

The source of the problem however is this:

It’s a download from S3. It’s expected that this is expensive, also it’s expected that you shall never need to do that ever, until your machine and all local copies of data burn in flames and you need to do full restore.

So, what caused 684GB to be downloaded from S3?

sevimo · 16 October 2022 00:04

check --chunks ? .

rjcorless · 16 October 2022 08:44

Dunno! I ran a ‘prune’ a couple of times to check how it would work going forward, but other than that I did not download anything. Just ran backups.

Anyway, I’ve extricated myself from the AWS service. If I cannot understand/control my spend, then need to look for alternates. Recently (via windows admittedly) I was backing up using Syncbackpro and writing direct to an online Onedrive account. It was managing up to 10MB/s so I may just go with that as it’s no extra cost to me.

saspus · 16 October 2022 23:10

Picking alternative storage is not a solution for unexplained egress. Regardless of which storage provider you end up using, there shall be no mysterious traffic, even if it’s not explicitly charged.

I would review your logs, and while I too think the root cause @sevimo suggested is a likely culprit, you need to get to the bottom of it. 600gb is not a bag of peanuts to get lost in a shuffle.

doher · 6 February 2023 19:07

How would you avoid this charge? Can you run a check without --chunks to avoid downloading?

sevimo · 6 February 2023 19:21

You don’t need -chunks flag to run check, without it it will only check existence of referenced chunks, and will assume that if chunk file exists it has the right content. This will avoid most of the download activity on check.

doher · 6 February 2023 19:43

Got it, I just looked up the documentation and posts about the chunks and files flags. Thanks!

rpendleton · 26 February 2023 07:30

On the note of checking chunk content, I think it’s also worth mentioning that AWS doesn’t charge for data transfer between an S3 bucket and other AWS services in the same region.

So, if you really want (or need) to check the content of chunks stored in S3 for some reason, it might be cheaper and/or faster to perform the check using an EC2 instance. (This is especially true if you already have an EC2 instance that you’re willing to use.)

To be clear, there are still some costs associated with checking chunk content with S3:

You’ll still be charged for S3 GET and LIST requests.
If you’re using S3 Intelligent Tiering, reading older chunks will still transition them back to the more expensive frequent access tier.
If you’re using other storage classes that have retrieval fees, you’ll still be billed those retrieval fees.

With that said though, if you already have your mind set on checking chunk content, these costs are relevant regardless of whether you use an EC2 instance or not.

This means the main thing to consider is whether the cost of the EC2 instance would be cheaper than the S3 data transfer costs.

I haven’t personally tried this, so it’s hard for me to say for sure whether it’s actually cheaper or not. I figured I’d at least point it out for those interested in checking chunk content with S3 storage though.