Google Cloud Storage options

robdejonge · 17 October 2024 07:52

I’ve happily been storing my backups on Backblaze B2 for years now. Back then, it was the cheapest option available. But now I pay ~$7.50 per month. Not a whole lot, but still.

Someone suggested Google Cloud Storage as a cheaper alternative. But it’s not entirely clear to me how this storage service works. I thought to post here, as I expect a lot of like-minded people frequent this forum. (mods: feel free to delete this post if it’s too far off-topic, and apologies)

Google Cloud Storage has 4 different storage classes. All I do is upload encrypted Duplicacy backups to it, and once per month I prune old versions. The marketing material talks of ‘accessing once per month’ or ‘per quarter’ and this confuses me. So I’m wondering: is the “Archive” class appropriate for my use case? If it is, I’ll reconfigure Duplicacy to save $72 per year!

Can someone share their thoughts?

saspus · 17 October 2024 09:02

Archvie class on GCS has a 360 days minimum retention charge. It is suitable for archiving, for storing stuff you never expect to delete or download. The pricing structure reflects that: doing anything with the data is very expensive – like deleting early, uploading, egressing, – but leaving it alone – cheap. This is ideal for archiving - what the name of the tier reflects. Backup is a bit different than archiving though.

That “once per quarter” access language is a ballpark of what usage pattern will make that financially feasible.

I would not use archival tier for primary backup, especially fi you expect you might need to restore from.

ColdLine storage would be more appropriate, which is $4/TB/month, but there are still API fees and egress fees.

At this point you may want to consider STORJ – it’s also $4/TB/month, modest egress fee, but no api costs, and you get geo-redundancy for free.

robdejonge · 17 October 2024 09:07

Great answer. Thanks very much for pointing out the minimum retention charge, and the alternative option.

towerbr · 17 October 2024 11:21

I also think STORJ is a fantastic service, but there is another cost that has bothered me in the past: they charge per “segment”.

If you - like me - have a very large amount of small chunks, it really isn’t worth it.

robdejonge · 17 October 2024 12:51

I think I currently use the default chunk size which I believe is 50MB. I’m guessing this cost could be reduced somewhat by changing the chunk size to 64MB or just below.

This would result in a total segment charge of less than $0.20 per month for me.

I’ve never considered changing it to smaller size. Do you do this because your dataset has small files being changed and by reducing the chunk size you’re transferring less data every increment and reducing overal storage being used?

saspus · 18 October 2024 01:07

Default is 4MiB. It cannot be 50, must be power or two.

But I agree, 32-64MiB is likely better for most users, even outside of storj: fewer chunks, faster enumeration, smaller overhead. And over time average file size grew too — more pixels in photos, larger datasets.

Perhaps default needs to be updated as well to keep up with times.

saspus · 18 October 2024 01:30

This is a behavioral charge — they want people to increase object sizes to reduce impact of fixed per-object overhead (latency, time to first byte, acccounting, etc) and improve performance, and financial incentives are the best type of incentives.

All cloud storage providers do that — something’s it’s rolled into cost of api — some calls are more expensive than others, sometimes it’s punitive in some other way — like minimum retention fines on cold-er tiers.

Storj does not charge any of those, and does not play games with burying those costs into api calls, that do not map 1:1, instead they add per segment fee, which is insignificant in the optimal usecase.

That optimal usecase happens to also work better for other storage types — local servers, Amazon AWS, blueray archival disks, etc

robdejonge · 18 October 2024 05:03

Thanks for clarifying. My config files all contain what I think are the default settings …

    "average-chunk-size": 4194304,
    "max-chunk-size": 16777216,
    "min-chunk-size": 1048576,

But inspecting the files on my storage backend, I’m not seeing what I expected. I thought Duplicacy would be ‘collecting files into a chunk until adding the next file would overshoot the pre-defined size number’ and that a new version of that chunk would then be uploaded if any of the files within had changed. Given how much chunk files vary in size, I’m now not so sure that is how it works! I’ll be reading this when I have a minute and the right mindset to better understand.

Is there a way for me to query my backups for how many chunks are uploaded in a specific storage location? Migrating from $6/TB B2 to $4/TB Storj could be worth spending a few minutes on reconfiguring Duplicacy, maybe.

towerbr · 18 October 2024 11:09

That’s not my use case. My files are 90% “office” type (spreadsheets, texts, etc). My large media files are synced with Rclone.

If I set a 64MB chunk and change a few KB spreadsheet, I’ll have a new 64MB chunk.

towerbr · 18 October 2024 11:15

You can use check command with -tabular option:

| rev |                          |  files |    bytes | chunks |    bytes | uniq |   bytes |    new |    bytes |
| 217 | @ 2024-09-30 01:35       | 448961 | 490,805M |  92418 | 613,082M |    3 |  4,436K |      7 |  10,996K |
| 218 | @ 2024-10-01 01:44       | 448994 | 491,027M |  92458 | 613,272M |    1 |    778K |     49 | 211,708K |
| 219 | @ 2024-10-02 01:35       | 448995 | 491,027M |  92461 | 613,272M |    6 |  3,191K |      6 |   3,191K |
| all |                          |        |          | 112978 | 741,966M | 7530 | 40,116M |        |          |

saspus · 18 October 2024 14:40

Do you have terabytes worth of spreadsheets that change daily? If not, there is no issue. You can use Amazon s3 hot storage too — the cost is still minuscule.

Don’t use fixed chunk size. You can keep minimum at e.g. 1MiB and maximum at 64, or 128.

robdejonge · 19 October 2024 02:53

With my current configuration, I’d spend ~$1.70 on segment fees over at STORJ. Perfection for my 1.25TB of storage used would cost $0.17 in segment fees. Let’s say some playing around with chunk size can get me to somewhere in the middle, at about $1 per month in segment fees. Total monthly bill at STORJ would then be $6.

At Backblaze, I currently pay $7.50.

After the transition period where I run up a double bill, I’d be saving $18 per year. Obviously savings would be bigger for those with larger storage needs!

I’ve not given up on Google Cloud “archive” storage though. I’m wondering if I split my cloud backups across two providers: the stuff like my media library which has not changed in years could happily sit on Google Cloud while the more dynamic stuff sits on STORJ. But I’m not sure it’s a good idea to be splitting things across multiple providers, and the administrative hassle that all involves.

saspus · 19 October 2024 07:08

Here is an improvement on this idea:

Media files are immutable, incompressible, and non-deduplicable. Therefore, there is not much sense in wasting time and resources attempting to compress, deduplicate, or version them: they only change when they are corrupted.

Hence, there is no need to use any versioning backup software; instead, create a bucket with object-lock on Amazon Glacier Deep Archive (it’s cheaper than Google, has shorter min retention, and you can restore few hundreds of gigabytes per month for free) and rclone copy all your media there. (If you want to encrypt your media --that’s fine too, you can use rclone crypt backend, but I myself don’t bother)

And for a tiny subset of active data you can use literally any provider, difference will be pennies.

I use very similar approach – bulk of media (photos, videos, movies, etc) goes to Glacier deep archive, and the proverbial ~/Documents folder – to Storj. I have about 3TB on glacier, and about 400GB on storj. I pay about $3.60 a month to amazon (this also includes few other clouds services, I don’t remember breakdown) and storj is $1.50.

robdejonge · 20 October 2024 02:15

Storing on AWS Glacier Deep Archive costs $2/TB, with a minimum 180 day retention period. That would work for things like archives and media files like you said. However, retrieving data costs $90/TB. Of course I hope to never need that, as it would main several other backup layers have failed and I likely have really big problems. But still, that is quite the price tag!

Using duplicacy for this archive, the only thing this would do is create some meta data files for each revision (=/snapshots/?) but as all chunks will remain the same (or perhaps some are added to it when I add stuff to the archive) the additional cost would be fairly minimal even if each snapshot file (<1kB in my case) gets stored.

I was not familiar with the rclone software, but from what I gather (and if unencrypted), using specific tools (like Cyberduck) I could access the S3 GDA bucket and basically see a browsable filesystem as if it’s on an ftp server (am I dating myself here?)

saspus · 20 October 2024 05:37

I’m not sure where you are getting these numbers. Storage is $1/TB/month, retrieval cost depends on speed you need data back. Bulk retrieval costs $0.0025/GB, which is $2.50/TB. You can restore up to 100GB/month for free.

But yes, this is an insurance policy. Restore cost does not really matter, you never expect to need it.

Duplicacy cannot use Amazon Deep Archive. It has thawing requirement. There were talks about adding support but it went nowhere. Google Archive is the only archival tier you can use with duplicacy that is not outrageously expensive. Amazon Glacier Instant retrieval will also work – but it cost the same as STORJ to store + retrieval and transfer fees, it makes no sense in this scenario.

Yes, both rclone and Cyberduck do similar things, you can see the file list, but to restore your data you need to first submit thawing request, wait some time, up to 12 hours, depending on cost you are willing to pay, and then restore from S3 standard tier where the data will be moved into.

Some tools automate that, e.g. Amazon S3 — Cyberduck Help documentation

robdejonge · 21 October 2024 02:56

AWS pricing isn’t the same globally. I picked Singapore as the datacenter nearest to me.

Ah, ok.

Thanks for all the comments. I’m going to mull this over a bit before I make a final decision on how I’ll proceed. It might delay me saving a few bucks per month, so I’ll not spend that just yet!

saspus · 21 October 2024 04:07

Right, but why pick the closest, that also happens to be double the cost? They clearly don’t want you to — that’s what that cost means

You shall pick the cheapest from the ones furtherst away from you: I’m on the west coast of US and I backup to Ireland. Why? Because if Yellowstone wakes up — my data is safe across the pond (for a few days, we are all screwed in that scenario anyway, just as a thought experiment, but this accomplishes geo redundancy). I don’t care about the speed. It’s a last resort backup. It will take 12 hours to thaw the data — there is no hurry.

Sure, think it over — the idea being the whole setup shall survive decades, so it shall be portable and simple. Both duplicacy and rclone are opensource, golang does not go anywhere other, nor does AWS. But if they do — you can build everything from scratch for future cpu architectures or use existing binaries to migrate to new storage.

towerbr · 21 October 2024 12:49

I’ve been testing the following setup for a while: I back up to Backblaze B2 buckets in the US with Duplicacy, and then copy those buckets to AWS Deep Archive in Europe using Rclone. The idea of the second copy in AWS is just to be that insurance if all else fails, i.e. I hope I never have to use it.

Ideally, I would be able to do the copy with Duplicacy as well (they are backups with RSA keys), as this would also include a built-in check.

saspus · 21 October 2024 16:35

This is quite an expensive insurance — if you need to restore even a single file you’ll have to unfreeze pretty much entire dataset.

Copy decrypts the chunks, so you need to provide password and the key.

related tangent

On a related tangent, I don’t encrypt my backups: my backups are for me. It’s personal data, not blueprints for military grade time machine. I don’t want anything to reduce data availability; having another key I can lose just adds an unnecessarily point of failure. Any storage account is already access controlled.

I know some people are way more paranoid than me — but I realized that five levels of security is excessive and counterproductive. In my thought experiment where all my data is suddenly posted publicly — I will be mildly inconvenienced, not enough to suffer through password management for a lifetime.

I also strongly believe passwords are evil and must go. I’m happy to see the progress in this area — more and more services implement passkeys now.

towerbr · 22 October 2024 00:57

I guess I could use a filter with restore if needed.

However, keep in mind that this backup is not intended for restoration. Instead, the one from B2 should be used. This backup exists solely for the extremely unlikely scenario (0.0000001% chance) that B2 encounters a catastrophic issue with my data, or in case the “Yellowstone” event you mentioned happens.

That’s why I use Rclone—so I can perform the copy without needing to provide the password.

That’s probably me

I’m simply aiming to fulfill the “2” in the “3-2-1-1-0” backup principle. By encrypting the backups, I can store them virtually anywhere without concerns about unauthorized access. This is one of Duplicacy’s key strengths: it’s agnostic to storage location.