S3 Glacier Deep Archive?

aweber · 8 December 2020 14:14

Has anyone tested using a S3 ‘Glacier Deep Archive’ bucket as the storage target? It appears to be ultra-cheap for storage, but a bit “difficult” to restore from…which, for long-term backups might be OK.

But I’m not clear on what API restrictions are on that storage-type and whether Duplicacy can use it. Also, to restore, it appears you have to explicitly use AWS API to restore the “file” (maybe chunks in this case) to a “Restore Tier” (“Bulk Retrieval” or “Standard Retrieval”) and then the actual file/chunk may not be available for 12-48hrs.

So maybe this isn’t a good target/backend? Maybe GC’s Archive storage class is a better fit, but more expensive?

andybao · 12 December 2020 03:11

Very much hoping for support for this too. There was a discussion here previously about this and I think it ended on the note that Duplicacy does “backups” but Glacier is for “archiving” so the two aren’t a good fit for each other.

Personally, I believe Duplicacy should still look towards implementing backups to S3 Glacier. The main reason in my eyes is cost. Archival storage is around 10x cheaper than non-archive storage. That’s the difference between paying $100 per year for backups v.s. paying $1000 per year for backups.

My backups are essentially archives, I basically never touch them. My system is running on a RAID array so the only time I would need my backups is if I’ve been hit by a ransomware attack, house burns down, my RAID array somehow completely fails or I accidentally delete some critical file. None of these events happen frequently to me so most years I never use the restore functionality of my backup software (had a BackBlaze subscription for 2 years before switching to Duplicacy and never used the restore button even once).

Currently the only reason I can afford my backups is because I’m using unlimited storage GDrive…

aweber · 14 December 2020 13:47

I am using unlimited GDrive too. That option runs out next year (technically you have two years grace period, I think…depends on how you read it).

I have a similar use-case. My offsite backups are secondary. My freenas box gets first backups and I’d restore from there if I had to. Offsite is for the obvious reasons.

My question was on the basis of how I understand (roughly) duplicacy works. Given it would delete chunks on a prune, Glacier could get annoyed. Though I think what would happen is that they’d just charge you the difference to 180 days (or 365 days, again, depending on what you read). If chunks, once written, are immutable until deleted, it might not be a stretch to support it.

Droolio · 14 December 2020 15:04

If you’re referring to the changes on 1st June, this is only relevant for consumer products. G Suite etc. is unaffected.

aweber · 14 December 2020 15:19

@Droolio that is indeed what I was referring to. I thought I read that too, but they way I’m seeing the changes reported, it’s very confusing.

Thanks.

saspus · 14 December 2020 17:05

I dont see why would not you want to continue with google drive?

Google workspace unlimited is $20/month. But even business basic does not seem to enforce limit

andybao · 14 December 2020 19:30

With the introduction of Google Workspace, Google made a change in the fine print, changing the “Unlimited storage” stipulation to “5TB x # of End Users, with more storage available at Google’s discretion upon reasonable request to Google”.

This change is making me nervous, as I doubt that Google likes to change fine print for fun. I’m worried that they are making this change to prepare to kick out the unlimited users…

saspus · 14 December 2020 20:51

The $20/month account is unlimited. Even in the fine print. Literally says “as much as you need”

The $12/month account is de-facto unlimited: there is no quota defined and I’ve personally already uploaded over 2TB there. (TBH I subscribed intending to pay $20 for enterprise but then noticed there is no quota set at business standard… and I don’t think google makes “mistakes” like that. I’m pretty sure that is intentional. Effectively they just got rid of 5-user requirement that only existed on paper anyway, and dropped price)

I guess they likely are not going to be enforcing it just as they were not enforcing the 5-minimum user requirement.

Again, if you don’t want to rely on “likely” — $20/month is unlimited even on paper.

And finally, even if they force you to pay $20 per 5TB — it’s still cheaper than anything else. But we’ll deal with that when and if that happens.

andybao · 14 December 2020 21:11

The pricing page is not the “fine print” that I was referring to. I was referring to this page: Google Workspace Terms of Service – Google Workspace

Again, I would love to rely on hopes and dreams but I don’t believe they changed the wording on that page by accident. I’m sure there was some reason they changed “Unlimited” to “5TB x # of End Users, with more storage available at Google’s discretion”.

This is also not true. Glacier is much cheaper at only around $1 per TB. Of course this doesn’t account for data retrieval costs but my point is that many people rarely need to use their backups (and I believe the fact that most people don’t even have backups supports this point).

saspus · 14 December 2020 21:41

This clause (which reads “with more storage available at Google’s discretion upon reasonable request to Google“) is to prevent some dude with 15PB or data from abusing the service. And what exactly reasonable request is? “— More data please! — here you go!” ?

Anyway, this all does not matter. Today there is no cap. Once they start enforcing caps (they haven’t done it on 1TB g-suite accounts and instead they dropped price) — then we can look for alternatives.

This is indeed storage cost only. There is also traffic charges. And api cost. It gets very very expensive even just to backup unless you optimize for glacier specifically.

And then Backup that you don’t plan to restore is called archive. And I’m sure there are archival tools that already include all those optimizations. But they are not backup tools.

It’s so cheap precisely because it’s unusable for anything else. Backup tools must be extremely simple, minimal, and easy to understand to be trusted. Building complex contraptions just to use wrong service for the job would be highly counterproductive and instantly negate any perceived cost savings at the first restore.

Maybe accrosync can come up with archiving tools later. But that’s entire different industry.

Anyway, this is all to say that in my opinion:

archival storage shall not be supported.
google is the best solution today for people with 2+TB of data — no reason not to use it.
the very best approach would be to pay for actual commercial cloud storage — Google Cloud Storage, B2, S3, Azure, etc. it will be more expensive but that’s what it costs. Storing data is expensive. Still cheaper than nas on-premise. Backup of your data is not a place for compromises.

andybao · 14 December 2020 22:18

That’s a lot of claims without any evidence. Let’s take a look at the Deep Glacier pricing page. It costs $0.05 per 1k PUT/LIST request. Duplicacy chunks are ~4M on average. So it costs us $5 to upload 400GB of data. So it will cost us ~$125 to upload 10TB of data. That’s not very expensive in my eyes.

Note that we don’t have to worry about upload traffic as ingress traffic is free.

What about cost to “check”/“prune” the storage? Well, taking into account nesting, we need 1 + 26 * 26 = 677 LIST requests to discover every chunk in storage. Since we pay $0.05 per 1k LIST requests, it costs us around $0.05 per “check” pass. EDIT: Realized that I might have to account for pagination but that shouldn’t change this number by much.

So all-in-all, really not that bad. Definitely not “very very expensive”.

I’m not saying that this is trivial, but I do not think this is rocket science. The only thing preventing us from backing up to Glacier is the metadata chunks and the file chunks are indistinguishable. We literally just need some way to distinguish them or an option to store them in a separate storage location.

Also, the reason I’m pushing for this so hard is because Duplicacy is simply the best backup tool I have ever used and it’s really close to being my dream backup tool.

towerbr · 14 December 2020 23:12

I have ~2TB of backup, and I just don’t trust any “drive”-type solution, simply because they weren’t made for that.

I totally agree, so I use B2, the cheapest storage solution.

About the price discussion (Glacier vs. the others), mathematics favors Glacier, but there is a more difficult component to value: time. The time it will take to retrieve Glacier files to hot storage in the event of a restore. The time you will spend managing costs, API calls, etc. And - for my level of use - the price difference just doesn’t pay for my time.

forresthopkinsa · 30 May 2022 23:32

Brief, simplistic comparison I threw together that y’all might find interesting. Not comprehensive by any means but it absolutely convinced me to use Glacier for backups.

I really love Duplicacy and I’m hoping I can use it for Glacier backups but based on this thread it sounds like it isn’t supported? What work needs to be done to change that? Can Duplicacy calculate deltas just from LISTs or does it need to learn how to use S3 Object metadata?

sevimo · 31 May 2022 17:03

I think supporting archival storage types is a very worthwhile project goal for the next big thing. As other people mentioned, storage costs in something like S3DG are significantly lower than any other alternatives, which would allow to store older/more granular revisions for longer, instead of going with more aggressive pruning policies. Tradeoff is indeed retrieval speed/cost, but this should be a target of the last resort in the multi-tier backup scheme, meaning you’d be retrieving from here only when all other options has failed. And for that very infrequent event I’d gladly pay time/money cost for archival retrieval, especially considering you can still do partial retrievals, or spread it out over time if needed.

Given limitations of how archival storage works, I strongly suspect that the optimal solution will involve native support for multi-tier storage. Basically, operate on metadata stored in hot(ter) storage like regular S3, with the data chunks stored in S3DG. And given low costs these chunks would not need to be removed for a very long time, perhaps ever.

This gets us the best of both words - removes limitations on metadata operations associated with glaciers (and still cheap because metadata is small) and bulk of the data would be in the glacier and in all likelyhood will never be touched again.

Would really love to see such support on roadmap.

towerbr · 1 June 2022 10:52

This would be a very interesting architecture.

bkeeper · 1 June 2022 13:22

Yes, we would need to separate metadata chunks and code 2 separate pathways for data and metadata.

By default it would all go to the same storage and tier, (unless a cold storage is selected) but it would also allow us to chose not only different tiers but also different storage providers entirely.

The most difficult thing would be to code a completely different restore logic for cold storage.

In terms of cost, the savings would depend on the nature of the data.

For fast changinging data without dedup there are no savings at all.

Other nedded improvements:

Automate redundant copies of the storage config for security.
Automate local config backup just for convenience tied to the hostname.

sevimo · 1 June 2022 14:27

? The savings are based on the volume of stored data, so if you have the same amount of data stored in S3 vs S3DG, S3DG storage would be significantly cheaper. Whether or not it is fast-changing data is irrelevant, chunks are immutable until pruned (which is a separate decision).

bkeeper · 1 June 2022 18:10

Again, it depends on the data type.

S3DG has a minimum 90-day charge, so if you have fast-changing data that you can’t dedupe at all, you would be charged for a minimum of 90 days.

If you didn’t need a 90 days retention then it would be more expensive than B2 for example.

1,5 TB of new monthly data becomes almost 4,5 TB

sevimo · 1 June 2022 18:24

Fair enough, though if your data velocity+pruning strategy lead to chunk lifetime of less than 90 days, glacier-type storages are clearly a wrong choice, you really do need pretty hot storage.

forresthopkinsa · 3 June 2022 19:35

If you’re looking at keeping data in two different storage tiers depending on frequency of access, I would highly recommend checking out S3 Intelligent Tiering. It would probably do most of this work automatically – though I don’t know enough about how Duplicacy works to say for sure. You could also achieve this multi-tier storage a little more manually using tags and lifecycle rules.

I maintain that the ideal solution would be to store chunk metadata as object metadata in S3 and use LIST/HEAD operations to calculate deltas.

Disclaimer - I work for S3, but I’m commenting here in a purely personal capacity. I use Duplicacy myself and I want to back up my data as cheaply as I can (without sacrificing durability). I don’t often delete this data so minimum storage durations aren’t relevant to my use case.