Preferred storage backend?

tranceFusion · 1 March 2021 20:45

Is there a preferred storage backend for use with duplicacy?

I have about 1 TB (uncompressed) of stuff I definitely want to backup, another 1 TB of stuff I would prefer to backup, and another ~4TB I would backup if it didn’t cost me extra. My current ISP max upload speed is about 10Mbps. My dataset is primarily just being added to or manually pruned periodically.

It seems for 2TB and under that my best choice would be Dropbox as it’s $10/month with no additional fees, although their upload performance and storage redundancy isn’t really clear to me. B2 is similar priced as long as I don’t download anything, but I’m not sure how chatty Duplicacy would be during normal operation? I assume there would also be some storage backends that have less duplicacy users that haven’t been as thoroughly vetted?

Basically, while having the choice of storage backend is nice, since I don’t have any current preference I’d really just like to be nudged in a particular direction.

Thank you!

saspus · 1 March 2021 21:05

B2 is pretty popular here (you can get free egress through cloudflare and duplicacy supports that), Google Workspace is another great option, albeit with 750GB/day daily ingress max. With Dropbox I personally had issues, YMMV.

tranceFusion · 2 March 2021 16:48

Awesome, I was scared a bit of B2 because of the “corrupted chunk” thread, but it seems like wasabi had a similar issue, so maybe it’s just par for the course with these cheaper storage services.

To get free egress through cloudflare, I assume that’s a public endpoint to receive the data?

I don’t really prefer to use Google if avoidable due to privacy concerns, but I suppose it doesn’t matter if I encrypt the data first. Are Google One storage and Google Workspace the same thing? Looks like Google One is 2 bucks a month cheaper for the 2TB tier.

Do you know if the 90 day minimum retention policy at Wasabi is a concern? I assume the only thing that would be getting updated normally would be metadata so there’s not going to be a lot of churn on the bulk of the data?

Thanks so much!

towerbr · 2 March 2021 23:15

No, you can use public and private buckets:

This is correct.

Google One accounts are accounts for home users, while Workspace accounts are professional office packages made up of collaborative productivity apps that feature advanced administrator controls and personalization features.

It is not actually a 90-day retention policy, but a “minimum billing period policy” (for 90 days). It has an impact only from a cost point of view, if you delete any file or bucket you will continue to be charged for 90 days for this deleted file.

tranceFusion · 3 March 2021 02:34

After looking at their docs, the authorization definitely takes place between cloudflare and b2, so even though the bucket endpoint is private, the cloudflare endpoint is not… potentially someone could download and brute force your data if they were so inclined.

But from a storage standpoint - can duplicacy connect to both similarly? and do they provide the same durability?

I was mostly worried about duplicacy metadata churn causing an increased bill due to the billing policy, but after digging into duplicacy more it seems unlikely to be a concern in that regard.

saspus · 3 March 2021 02:39

Bruteforcing RSA encryption is not feasible. And then, CloudFlare has firewall. You can configure it to only give access to yourself.

towerbr · 3 March 2021 10:20

Probably, although I personally have never tested it.

I always recommend using storage services (Google Cloud, Amazon S3, Backblaze B2, etc.), although I understand that there is a - small - difference in cost. They have the advantage that you only pay for the space you use. And most importantly, they are much more reliable.

saspus · 3 March 2021 10:46

… explicitly designed for bulk usage (a opposed to storing and sharing documents), performant, and robust because the incentives aligned correctly

tranceFusion · 3 March 2021 14:18

I didn’t know that duplicacy supported RSA or that you could put a waf in front of cloudflare CDN. So, thank you!

It does seem like the space provided with Google Workspace is Google Drive (document sharing, etc)… so probably the same product included with Google One, rather than the G Cloud object store.

cheetahsquirrelporcu · 29 July 2021 03:33

… explicitly designed for bulk usage (a opposed to storing and sharing documents), performant, and robust because the incentives aligned correctly

Sorry to resurrect this old thread from March. I wish I had noticed it earlier. I just signed up for Workspace thinking I could use the storage for Duplicacy backups. I’m still within the 14 day trial period, should I even attempt to back up several TBs, or cancel immediately and go with Backblaze B2/Amazon S3/etc.?

I was under the impression that the only drawback to Workspace was the 750GB daily ingress, which is fine given my bandwidth, but it sounds like there’s more to it than that.

saspus · 29 July 2021 09:16

I use workspace and it works well. I have about 1.5TB in one dataset and about 4TB in the other. Works great.

There are however caveats, but fortunately not a dealbreakers. If you review recent topics you’ll find a reference to an issue when check may fail and the chunks may appear missing due to there being two sub folders in the chunks folders with the same name. This can happen because each folder still has different ID and google drive storage is eventually consistent so if you backup from multiple machines it is not impossible for the race condition to result in duplicate folders.

You can fix that with rclone as described here, and we should ask @gchen to implement auto-healing or handling of this scenario in the duplicacy.

Another limitation is that the shared folder on google workspace is limited to 400,000 items. It’s easy to exceed that limit. Solution is simple —
Don’t backup to shared folder.

I personally backup to an app folder using service account as described here Duplicacy backup to Google Drive with Service Account | Trinkets, Odds, and Ends. Or alternatively just backup into My Drive folder.

cheetahsquirrelporcu · 30 July 2021 05:38

This is really great information, thank you.

In my case, I’ll need a few more TBs, and I have 1 million+ files. I read your (very cool and detailed) instruction guide for backing up to Google Drive with Service account, but alas, I don’t think that level of complexity is for me. My data is important, irreplaceable stuff, so I’m apprehensive to rely on experimental methods to run around policy. A “true” storage service like B2 seems a safe, reliable bet for someone at my skill level and the value proposition of $400+ per year (ouch) in B2 storage, while expensive, seems compelling. Not to mention the incentive alignment you mentioned.

saspus · 30 July 2021 08:57

Oh, just to be clear — all that dance was needed to simply avoid duplicacy datastore appearing in My Drive — to not pollute Recent Files and to avoid accidentally deleting something there in fat finger accident

If you have dedicated account just for duplicacy — then this is not a factor and it will “just work” out of the box, as long as you place the datastore into My Drive (as opposed to Shared Drive).

B2 is also not without issues though… nothing is infallible

cheetahsquirrelporcu · 30 July 2021 18:55

B2 is also not without issues though… nothing is infallible

Gah! I was afraid you’d say that. Just when I thought I was starting to understand things, the plot thickens…

mfeit_duplicacy · 3 August 2021 13:41

In with a late comment…

Chunk churn’s going to depend on how much the data you’re backing up changes. The cost of storing the metadata is noise within rounding error.

The pruning strategy to get the most for your money with Wasabi is to keep any data you’re still paying to store, which is every snapshot younger than 90 days. Beyond that, keep whatever works for you. I keep all snapshots for 90 days, one per month out to a year and nothing older than that. Restores for anything other than testing purposes are (fortunately) a rarity for me and I can’t think of an instance where I’ve gone after anything in a snapshot older than a week.

saspus · 6 August 2021 03:14

What endpoint are you using and how has reliability been so far? I rage quit using wasabi because over course of a year us-west-1 was unusable, and us-east-1 better, but still too flaky. If home user like myself notices instability I wonder how can they manage to sell storage… In addition, during the same year they hiked the price from 3.9 to 5.99, killing any advantage over B2…

mfeit_duplicacy · 6 August 2021 13:03

The bulk of my data is in us-east-1, has been for almost four years and it’s been rock-solid reliable. There have been a small handful of backup failures that weren’t Wasabi’s problem and restore tests have always come up correct. I’ve done some small-scale experimentation with us-west-1 that went without problem, but not enough to get a bead on whether there are any reliability issues.

I’ll apologize in advance for the third degree here; I make my living developing systems to measure and troubleshoot network and system performance:

What do “unusuable” and “flaky” mean? What kind of problems were you experiencing? What steps did you take to verify that the problems were actually Wasabi’s fault and not caused by your systems or the network between them?

I was one of their early customers, so I’m still in at $3.99. Egress is $40/TB, which makes a full restore or verification expensive. I’m not worried about verification beyond a chunk inventory because Wasabi verifies the integrity of each object every 90 days and I’m satisfied that the steps they take to store the data will result in high durability.

To be fair, the higher price includes unlimited egress, which is a good deal for some applications. My backup arrangements should make having to do a full restore from the cloud a rare-enough event that the expense will be worth it. Even so, if I switched to the current pricing, the increase on my monthly bill wouldn’t be enough to squawk about.

I tried B2 as part of an extensive survey of storage providers. Unfortunately, the day I started my evaluation, they did maintenance that went very badly. The planned window was four hours and they ended up down hard for eight and didn’t get everything back up until nine had passed. They they did it during business hours, didn’t have a plan to abort and roll back when things got bad and were doing something that should have been transparent to the storage system. That last item hints that they may have architectural problems. Needless to say, I noped it right on outta there.

saspus · 6 August 2021 21:38

As far as I remember (and I don’t have logs since that time) it were API timeouts or 500 errors; several times a week. I’m pretty sure it is not my systems or connection, backup from the same machine with the same software to AWS was working fine. It was on us-east-1. I live on west coast, so when that datacenter was brought online I tried to switch. it had same sounding issues more frequently, so I switched back to us-east-1. I though they eventually will fix those early bugs, but over few months nothing changed, so I switched to B2, and closed account, otherwise they charge you for a terabyte anyway.

A year later tried it again (in-spite of higher cost) and it wasn’t any different.

Wait, they verify and if corruption is detected then what? Do they have any redundancy to recover?

My idea was that I likely would never need to restore, and if I do – IIRC you could switch to a new plan (twice a year or so?) and download for free. With B2 there is integration with Cloudflare (bandwidth alliance?) to get free egress. Or just pay the fee. The cost to risk product justifies that.

Yea, they had some issues recently returning corrupted data too. While issue are can happen with any product – at such late stage to have these basic bugs it is bizzare. The system shall be designed so that it either returns good data or none at all. Go figure what they have concocted there.

I don’ trust them either (I had a chat on reddit with one of their folks about client side encryption they advertise in their Personal/Business backup products – its’ very misleading to say the least; TLDR – if you have to give keys to backblaze “for a very short time, and we promise we don’t save them” to decrypt your data – it’s not a client-side encryption). It’s a really strange company :). I too switched away.

iocularis · 8 August 2021 13:06

Hi, to what other provider you switched?

mfeit_duplicacy · 8 August 2021 15:28

I’m physically- and topologically-close to us-east-1; the path is Me -> Verizon IAD8 -> Hurricane Electric ASH1 -> Wasabi. I work with a few of their big customers who aren’t shy about sharing their opinions. If Wasabi was unreliable, I’m pretty sure it would have come up.

Of course. It’s erasure-coded across drives in a diverse set of chassis and racks. Wasabi has a whitepaper on it. IIRC, it’s 20 drives that can survive the simultaneous loss of any four, which would be RAID6. If they’re running the place as I think they should, the verification is not that the object as a whole can be read from the front end but that the data and parity blocks underpinning it are present and correct. It’s a good hedge against problems that they can fix before they become customer-affecting.