I’ve read a TON of older threads on this topic, and that means I’m convinced by @saspus in his argument for ‘aligned incentives’. (Also I tried backing up to my onedrive business account and got all sorts of rate-limit and throttling issues.). Therefore I want to use AWS, google cloud, Wasabi, B2, etc. But all the previous topics seem to be a few years old.
I only have about 175Gb of data, which will only change in very small amounts, and which I hope to never need to restore (this is an off-site backup for potential disaster recovery). Since providers like Wasabi and B2 want to charge me for 1TB, they look relatively expensive at $7/month.
So then I turn to AWS and GCC, but I immediately find myself lost trying to figure out:
A) which storage tier
B) how much is it actually going to cost
Can anyone with recent experience provide me some guidance on those two questions.
I’ve read a lot, but I still don’t really understand what will happen if I choose the AWS “glacier flexible retrieval” option, or how it would compare to other AWS or GCD options. Or if there’s some other backend I should be considering…
For about 200GB of data cost probably would not matter much if charged for actual storage (unlike Wasabi). BTW I don’t think Backblaze B2 has a minimum 1TB charge. Unless something changed recently?
Glacier Flexible retrieval won’t work – duplicacy cannot deal with storage that requires thawing.
Duplicacy will either not able to see the files, or see the files but fail to download them: to restore from archive objects need to be “thawed” into hot storage first and read from there later.
Glacier Instant retrieval is $4/TB/month just for storage, but you also have overhead costs, at that price you can use Storj or B2 and not be nickel and dimed.
Google Archive tier will work - but it’s ridiculously expensive for what it is, with 360 days minimum retention period(!).
Glacier Deep Archive would be the best – but duplicacy does not support thawing… Only solution I know (and use as a second backup) is Arq.
My personal choice for some time now is Storj, and overriding default chunk size (set average to 32 or 64MB from default 4, to minimize number of chunks to improve performance and lower cost). It appears to be the best in terms of bang for buck for now – you get geographical redundancy but pay cost below what B2 and Wasabi charge for a single location data, and performs very well (if you have beefy internet connection – i.e. the bottleneck is your internet and CPU, not the service)
Thank you – that’s super helpful. I had completely misread the B2 costs as based on a minimum of 1TB (like Wasabi).I realized that Glacier Deep Archive wasn’t an option because Duplicacy doesn’t support thawing, but didn’t quite understand how Google Archive works – sounds like that’s best avoided for my case.
At $0.006 per GB, my B2 costs would be less than a $1/month, with Storj being closer to $0.50/month. Both very reasonable. I guess the deciding difference between those 2 is probably which one is more likely in the coming years to increase prices dramatically or have a major drop in reliability (or just shut down entirely), thus forcing me to set up something new.
I have been using Arq for many years, and will continue to do so for at least the next year, but I really like the de-duplication, and I’ve spent the last few weeks knocking the rust off my command-line skills and figuring out how to get launchd working on my Mac. So I plan to implement a duplicacy remote backup system in parallel with Arq, and then see where I am a year from now. Arq has been great for me, but I realize I’m relying on one single person.
Nothing is forever, and it’s best to deal with this when it happens.
Same logic here, neither arq, nor duplciacy (also a one-person team) is forever. When either of those decide to close the shop the code does not turn into pumpkin and can be still use for years, until OSes diverged enough to break it. Only then I woudl consider replacement.
Personal opinion:
I never had a good interaction with Backblaze, and I don’t trust them with my data. Their user facing apps are unmaintained garbage (stale data in config file that misses the point, backup app that does not backup large files for up to 48 hours by design, backing up terabytes of transient garbage, etc), their responses on bug reports are inconsistent and mostly CYA (Why did not you file support request? I did, and you ignored it two years prior), and while one could argue quality of their user facing apps is not necessarily indicative of their backend software quality – this argument is defeated by them having had more than one failure recently where B2 endpoint was returning bad data without reporting errors (you can search this forum for details). Fortunately, data at rest was intact, it’s just reading and assembling shards code was broken. But at this stage of product maturity this kind of nonsense is absolutely unacceptable. So they are either negligent or incompetent – in either case I would not want to do anything with them.
With Wasabi I had persistent reliability issues on their us-west-1 endpoint, after which I gave up and switched to us-east-1, that was a bit better, but with much higher latency. Then they jacked up prices within 2 years form $3.9/TB to $6 (or $7?) and would ban your account if you download more than you upload. On top they have 3 month minimum retention policy. All around greedy company. I belive their founders are the people or related to folks who started carbonite – a garbage “unlimited” backup service that would by default silently skip large files. Genius.
Stoj is a new kid on the block, and so far it’s going fine. Due to its end to end encrypted architecture it cannot return bad data, and mathematically the durability is quite good; their white paper is quite convincing. I’m using them now for everything for the past few years – no problems so far. What will happen in 10 years – no idea. But we’ll cross that bridge when we get there.
Per my understanding, Storj is not optimized for small files, in case one uses Duplicacy + Storj to backup small files, the pricing would still be competitive compared to B2?
Per my understanding, Storj is not optimized for small files, in case one uses Duplicacy + Storj to backup small files, the pricing would still be competitive compared to B2?
Yes, I’ve just now been trying to figure out what my actual costs would be for the segment charges on Storj. If I exclude my media files (which I might just rclone copy to ODB or AWS glacier) then I’ve got more like 100GB to backup, but it’s mostly small files – certainly more than 100,000 files.
That’s why you set chunk size to be close to 64. File sizes that Duplicacy backups up and file sizes that storj sees don’t correlate much.
This is an excellent approach. There is no need to attempt to compress and deduplicate incompressible data. Glacier deep archive with object lock is perfect for disaster recovery scenarios.
I would not worry about few bucks a month. Pick by quality, durability, performance, and reputation. Even if it was 10x more expensive it’s will going to be under $10/month.
You can set just the chunk size to 32. It’s going to be order of magnitude fewer chunks compared to default, it’s a “good enough” improvement. You don’t want to go too high becuse then you start wasting space. But is really no need to stress over these minor details. If you had several hundreds of terabyte to backup — then maybe it would make sense to optimize more.
fwiw, i switched to Google Cloud Storage from Wasabi around the start of last year. I set it to ‘AutoClass’ (transitions storage class based on access patterns). Most of the chunks have settled in ‘Archive’ and I’m paying a £1.20 or so per month for 535GB.
Oh, great – I’d love to hear more about this. Questions:
Were your costs higher for the first 30 days, first 90 days?
Everything is working fine with duplicacy backups, checks, prunes?
My previous understanding (see @saspus comment above) was that basic duplicacy operations would end up leading to higher costs when interacting with google cloud archive storage. I’m curious to know what happens when duplicacy runs a check or a prune operation, in terms of how it interacts with stuff that’s been moved to archive.
At £1.20/month for 535Gb, you are spending significantly less than Wasabi or B2, but with the reliability of google.
EDIT: It seems your costs must have been significantly higher for the first year, as the GCD auto class docs explain that nothing will be transitioned to archive until 365 days have elapsed. So I wonder if you could report what it cost to store that much data for the first 12 months?
i may have slightly different requirements, but yes, the costs were higher initially. from what I recall you get a 3 month trial anyway, so it was free the first 3 months and by the time I started paying it was £2.50 per month, which as you say is more or less where it sat until 12 months (i paid for 9), and then it dropped to around £1.20.
it wasn’t super intuitive to work out how much it ‘would’ have cost in the first 3 months (i was trying to work it out at the time) because I find their UI slightly impenetrable - it’s not really designed for this small-scale use. iirc it was around £2.50 per month as well.
i kinda took @saspus advice, why would i use a storage provider if i don’t trust them to store my data. it’s why i dropped wasabi, and went to GCS. if i can’t trust google’s reliability, then who? so I never run a full -chunks check on GCS, because that would indeed mess with the ‘Archive’ class, and it would be expensive. I guess if i were super paranoid at some point i could spin up some compute and run a -chunks checks on that.
i run a normal check job daily, and run a prune once per week (this was mainly for duplicacy’s benefit, because i found after a while that having thousands of revisions really slowed the UI down - and also kinda pointless, i don’t need 4 hourly backups from 4 years ago).
GCS is also not my ‘only’ backup, it’s the off-site backup. i backup to an in-machine regular HDD with EC (not the best i know) and also a RAID-1 NAS with EC (also not the best - i’d zfs it but i’m using the built-in QNAP firmware for now). the local ones i do daily checks and weekly prunes, and also weekly -chunks checks.
Yes, this is where I am too.** If possible, I’d really just rather use GCS or AWS. The tricky bit, of course, is that those services are either super expensive (standard storage classes) if they work easily with duplicacy, or not obvious fits with duplicacy if they are affordable (archive storage classes).
It’s possible that the GCS “autoclass” is a good compromise, assuming there aren’t some expensive gotchas that you haven’t encountered (or that others would encounter with different data and backup patterns).
I can’t get the 90-day trial as I signed up for Google Cloud Console long ago, but assuming no major unexpected costs for operations, it looks like for my 100Gb of data I’d pay:
$2 for the first month ($2)
$1 for months 2 and 3 ($2)
$0.40 for months 4–12 ($3.60)
$0.12/month after year 1
That’s $7.60 for the first year, and then less than a quarter a month after that.
From what I’m reading in the opaque documents, the autoclass seems to be set up in such a way that you avoid high fees for class A and class B operations, but will often pay for higher storage classes for some period of time.
I’m very curious to know how often an operation from duplicacy would cause data to be moved back up to the more expensive standard class.
I might try it out just to get better info on what it costs and how it works.
**FWIW, I know that @saspus and others have reported good current experiences with Storj, but for me “not trusting” the storage provider may also extend to Storj. It’s not clear to me how either the initial financing of this project (through the ICO) nor the current payment to people who run nodes (in Storj coin) is viable long term. Especially because it seems the company has sold almost all their own coins (their stock), and because almost no one who runs a node is finding it financially viable. (Storj is able to charge as little as they do because they are paying so little to the individuals who actually store the data on their homelab hard drives). If the token price drops, if people shut down their homelabs, them my data is no longer in good hands.
tbh i was prepared to pay at least as much as wasabi (was about $8 p/m when i stopped it) for GCS, if not more, in return for the increased reliability/trust. the fact it’s ended up being significantly cheaper is a bonus.
at the end of the day to try it out is not going to be a lot of money for 100GB, provided you don’t go crazy with a lot of ingress/egress.
It’s important to separate reliability and viability of a provider going forward, and durability of data they store.
Any provider can close the shop any time, some in more near or distant future, but nothing is forever. At that point one should be able to move data elsewhere. This means that provider shall not be the only backup destination, in case they just disappear overnight.
How data is stores is important. Amazon and google prices quotes here are for a single datacenter. If that datacenter fails (from being consumed by San Andreas fault or simply offline for a significant period of time you suddenly need your data back) – your data will not be accessible. To get some geo-redundancy you need to pay for multi-region tiers, at least doubling the already “super expensive” cost.
Absolutely, as it should, they too can and will disappear in some point in the future. So let’s focus on data durability while they do exist.
I studied their model extensively before trusting with my data, so let me clarify a few common misconceptions. (And if I sound like a fanboy as a result – well, so be it. I like the project, and use them a lot, not just for backup)
Company running out of funding case is discussed above – we’ll just move elsewhere. They however are doing fine, acquiring new companies, and number of node operators is increasing in-spite of payouts being decreased (original unsustainable payouts existed to bootrstrap the network), see below.
The point is, that it should not be, and isn’t, by design. The payment is low on purpose, to make the scenario you described (run hardware just for storj, and shut it down on a whim if something does not go their way) not viable
Then why do people bother running nodes in the first place? Because the choice is not “run hardware and get payment from storj VS not run hardware”. The choice is “run hardware and get payment from storj [by lending them unused capacity] VS run hardware and not get payment from strorj [let that extra capacity go to waste]”.
The point is that Storj pays you for using your already online capacity that is underutilized. Everyone – home labbers and datacenters alike – always do and will have an ever-increasing amount of unused capacity – because you expand according to anticipated future needs, not current utilization (otherwise you’ll need to expand again tomorrow). E.g. if you anticipate you add 2TB of data annually, you won’t buy a 2GB hard drive now. You will buy a 16-20GB hard drive, so it lasts you a few years. And most of the years you will have way too much empty space.
Only people who did not bother to understand the project think that, and they are very vocal on their forum They build hardware thinking to get rich quick, believing this is another “crypto project”, and then leave disappointed when reality strikes.
Vast number of the rest understand the project and quietly pocket free money: Hardware is already online anyway, therefore anything above zero is pure profit. Datacenters also participate there (that’s how they can offer SOC2 compliance), not just joes with a synology nas in the closet.
my personal anecdote, a sample side of one
I run nodes too, and I feel storj pays too much. I often express this on their forum urging them to reduce the payouts. Details?: Well, I have home server, and another home server at my brothers house, and yet another one at friend’s house, all few states away. We backup to each outer (zfs send/receive three way). Tha’s one of my backups. We cumulatively have about 200TB of free space. On each server I started nodes. Last year storj paid $700. This pretty much covers electricity costs running those servers (except for mine – where I live rates are obnoxious). Will I ever shut down nodes? No, why would we refuse free $700/year?! After initial configurations nodes required zero intervention all this time.
Token price is irrelevant. It’s a utility token, it’s not an investment vehicle, it is not a share in the company, in no way it’s tied to company operations. Operators payments are calculated in USD at the day of payment, converted to token, send to operators, who immediately convert that in their currency of choice. When storj runs out of token reserves – they’ll too buy them on open market, send to operators, who will immediately sell them.
Token’s purpose is to avoid dealing with complex international taxes and payments. This becomes responsibility of node operators to report these payment. Tokens also served another purpose originally – to bootstrap everything, but no longer.
And that’s why I trust storj more than AWS at this point – data is erasure-encoded and distributed to massive number of nodes, I think it’s 29:80 (i.e. 29 pieces out of 80 are sufficient). You can read more here (Understanding File Redundancy: Durability, Expansion Factors, and Erasure Codes - Storj Docs) or in their whitepaper. If 51 datacenters shut down overnight, my data still lives
And again, node churn is minimized by keeping payouts small, to avoid people building hardware, harvesting payouts, and leaving. That will never be profitable. Hence, only long term operators remain, and this also helps network perofmranfce – traffic is not wasted on repairs when nodes leave. There is of course some node churn, but the company can adjust erasure coding any time to maintain the stated durability. (I think one can request custom erasure coding levels if needed – but i haven’t tried that so can’t comment)
This is also not really true – node operators payments are a very small fraction of their operating expenses. They are cheap now because they are still a new kid on the block. If they charged same as amazon (for s single tier) – I would still pick them – because you get geo redundancy, and end-to-end encryption for free)
Well, you certainly make a strong case, and I hear you. I especially take the point about the integrity of the data.
The only point where I might come at it slightly differently, would be in terms of the value I place on the time to switch to a new cloud provider in the event of failure. You are clearly 1000x more advanced user than I, and it’s probably nothing for you to set up a new cloud storage if something goes awry with Storj. Whereas I would much prefer, after this period of experimentation and time spent, to set this up and not have to worry about it for a long time.
And in that light, I just think there’s lots of reasons to believe that Google is a more reliable long-term provider than Storj. It’s beyond the terms of a technology discussion, but I also think that Storj’s ICO funding model is not totally irrelevant to measuring how long they will be around. Most stable companies do not unload all of their stock for cash, nor pay their suppliers in stock. I take your point that Storj will eventually have to by the Storj token; it’s just that I think at that point the end might be near for their viability as a company.
But we agree: nothing is guaranteed about the future (certainly not these days) so we can only make choices in the present. I’m very intrigued by Storj, and also by the GCS autoclass options. Maybe I set up both and see how it goes…
That was never stock to begin with, company value has nothing to do with token value. It always was a utility token.
Agree with the rest though, it will take a lot more to bring google down than storj, just based on the size.
So perhaps dual-region google or aws is the way to go – least probability of needing to change things in the future. But then there is cost of risk – paying 3x now for storage to save some time in the future migrating – seems too much money.
There is a subtle distinction between backup and archive. The archive is never expected to be needed, and almost definitely not in its entirety. The pricing reflects that – cheap to get data in, cheap to store, but very expensive to retrieve, and punishable for early deletion.
Google punishes more than Amazon, so give then choice, I would go with Amazon: they also provide 100GB of free egress monthly – and that shall cover the occasional “ooops deleted wrong file” use case. The problem is that because duplicacy does not support data thawing you are limited to instant retrieval classes, and this alone brings cost to over $4/TB/month + overhead, and still provides only single geo location.
At this point storj becomes very attractive.
I think the aversion to migrate in the future should not be a consideration: instead, what’s best for your data today shall be used. In case of storj – if they jack up prices or disappear – I’d just migrate. Switching is not a big deal regardless of technical ability, if you can run duplicacy – you can migrate your data: setup another destination, run duplicacy copy, done. This can even be done in the Web UI.
My recommendation would be to try a few targets for a few months, see what works best. It does not need to be the cheapest – I’d argue your time setting things is not free and must be factored into everything. And if it turns out Amazon hot storage is the most user friendly – maybe extra cost is worth it. Especially, if you have sub 1TB – cost is irrelevant. It’s all under $10/month anyway
Somewhat relevant article on long term data preservation:
I have 2.5TB backed up to GCS and my monthly cost is ~$4 USD. I do not perform chunk checks on the uploaded data. I’m using the archive storage class, so if I ever need to restore from GCS it will be very expensive. The initial upload was a bit more expensive because it generated more “operations”. First invoice was for $35 USD.
I have the same 2.5TB backed up to iDrive also, and there the monthly bill is ~$9 USD. I check all uploaded chunks, and the initial upload did not result in any extra cost.
I started using GCS as an extra layer after reading about people not trusting iDrive (which is my main offsite backup target). However, after 2 years of completely trouble-free iDrive usage I think I will cancel the GCS backup.
I would not – specifically because you don’t know if it’s trouble free until you need to restore. I.e. if the data silently rotted you won’t detect it until you try to restore your files.
This seems too high. What is the breakdown of charges? Is that API cost, early deletion penalties, or…?
For comparison, with another backup tool that supports Glacier Deep Archive, last moth I paid $3.56 ; I have 3.0TB there total, part of which is kept in hot storage (indexes).
Amazon charges $1/TB/month for storage, google – $1.2, but that won’t explain almost 3x difference.
possibly you’re more familiar than i am and i’m misreading the charges, but afaik Autoclass doesn’t charge separately for retrieval or deletion from the ‘archive’ class. it’s all covered in the ‘management’ charge that you pay for the bucket ($0.0025 per 1000 objects per month). my usage is so low it’s hardly worth my time to work out if it would be cheaper to self-manage the classes.