What does it mean to turn on Erasure Coding for Online Storage?

gecko · 11 January 2023 17:08

Hi. I’m currently in the process of doing a complete system backup of my linux system to Backblaze B2. I’ve been out of the loop with Duplicacy for quite a while, so Erasure Coding was a new feature for me to get my head around upon returning to it.

I have read the Backblaze Blog regarding EC, and watched some YouTube vids on it too, and I think I have wrapped my head around it now.

My question is, what are the implications of enabling it on Online Storage? What are the ideal situations in which you would enable it? For example, say you had a Local NAS, would the NAS be aware of Duplicacy’s EC setting? And would the Data:Parity ratio set on Duplicacy need to match the number of drives in your NAS?

Given that I wanted to get my large Linux system backup underway asap, I turned on EC upon storage creation in B2 before I fully understood what it was, as I gathered enough to understand it would be more difficult to add later.

However, now I’m wondering if it was somewhat pointless to do on B2, given they use EC on their side already. I don’t understand what an EC setting of 5:2 in Duplicacy would mean when that data arrives in Backblaze. Are they aware of that command and can split the data across 5 drives?

Am I using more online storage than I needed to, or less?

Someone please help me fully understand the implications of using EC

Thanks

saspus · 11 January 2023 18:04

With EC enabled in duplicacy some of the data will be written with redundancy. This increases the likelihood that backup survives some specific type of media failures.

The only usecase for it is if you backup to a single local HDD.

In my opinion, this is a useless gimmick: storage shall guarantee data integrity and it’s not the job of the application to verify that: line shall be drawn somewhere, you have to trust filesystem, ram, cpu correctness; otherwise the application will become a full fledged OS.

So, if the storage is unreliable — the solution is not to add crutches to make it slightly more reliable; instead the solution is to switch to reliable storage, that guarantees data integrity, mathematically, not merely on a “best effort” basis.

Pretty much every cloud storage provider, including B2, Storj, and many others, already provides such guarantees. Enabling additional overhead in duplicacy just increases your spending on storage and provides no benefits.

For local backup to a NAS — use ZFS or BTRFs filesystem that supports data checksumming and healing. If your nas does not support this — replace the nas.

Increase in amount of data stored at the destination (that’s where redundancy comes from) and cost.

No, nas provides storage, duplicacy writes and reads file from it. That’s where their relationship ends.

These are not related in any way

Yes, B2 guarantees data integrity. It’s pointless to do that in duplicacy. You just end up paying more in bandwidth and storage.

If EC was some magic that allowed to compress data beyond what zip can do — everyone would be doing it :)) EC increases storage requirements — it stores more data than needed to be able to tolerate loss of some of it. It’s as simple as that.

TLDR — don’t use it in duplicacy.

gecko · 11 January 2023 18:28

Thank you for such a comprehensive answer!
Appreciate it

You make some interesting points, and kinda confirmed my suspicions as I read and understood more about it. Feel a bit silly enabling it on B2 now…

Would you go so far as stopping the 2tb initial backup my system is 75% through right now to start again on B2 without EC?

I read in numerous places that EC is a more efficient use of storage. But at the same time was ensuring integrity from data loss by increasing the data, hence my confusion with it using more or less data.

So do you think it was pointless for Duplicacy to add this feature? Is it literally only useful for backing up to a single local drive?

saspus · 11 January 2023 19:01

Glad my ramblings are helpful

Yes, I would definitely do that. Not only storage cost extra but also egress

Yes, it’s more efficient way to acomplish fault tolerance compared to a trivial way – e.g. by keeping another full copy of the data. Erasure coding allows you to use less extra data.

It’s not unlike RAID5 vs mirror: in a simple mirror your overhead is 100%, but with raid5 it’s 1/n. And yet, each contraption can tolerate 1 block failure.

Yes. Or perhaps local conventional RAID array that cannot recover from bit rot, until one procures a proper appliance. Also see this: Is erasure coding for Online storage providers recommended? - #2 by gchen

It’s one of those things that looks cool on paper and fun to implement, but is useless and/or counterproductive in real life scenarios. But I’m just another user, so my opinion does not have to agree with any other, including the duplicacy developers’. I’m also of the opinion that fewer features is better than more (look at Kopia - this is what happens when developers are allowed to run unchecked), and that support for *drive storage endpoints (OneDrive, DropBox, Google Drive, etc) should be dropped by duplicacy, and that hot storage is bad (expensive) choice for backup, and that support for Archival storage (Amazon Glacier Deep Archive) is long overdue, etc.

gecko · 13 January 2023 22:37

Extremely helpful, thank you!

I did as you advised and cancelled that initial backup and started fresh on new b2 storage on a new bucket, this time without EC.

It’s been interesting for me to learn more about EC, but yes, like you say, it seems like in this context, where most people are either backing up to a NAS with redundancy likely built in, or cloud with redundancy definitely built in, then EC seems like quite a niche use-case.

Thanks for imparting your wisdom

system · 23 January 2023 22:38

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.