Erasure Coding for Dummies?

Hello, I’ve been away for a bit but stopped back in and noticed the new feature called Erasure Coding.

I think this is something I am going to like… but I haven’t seen a sort of “basic” explanation of the feature. Can someone provide that? Is it basially parity applied to the chunks?

And is the best way to enable it then to delete all my data, create a new storage with it enabed, and re-upload? That is fine… just want to know a bit more.

Thanks!!!

This post on Backblaze’s blog explains Erasure Coding really well: Erasure Coding: Backblaze Open Sources Reed-Solomon Code

You can create a new storage with Erasure Coding enabled and then copy old backups over. No need to delete data first and I would not suggest it.

So you might remember @gchen our discussion some time ago about having some way to verify data integrity on the server/nas side. You had some ideas how to implement this but I think maybe had second thoughts at some point. I guess this Erasure Coding sort of goes a bit in this direction right…? Putting some redundancy on the server side.

So can we “check” the data consistency somehow then without re-downloading all the chunks or is that not available?

Thanks!

Borrowing this thread.

Do people here use Erasure Coding themselves? I’m backing up ~200 GB to an external BitLocker encrypted hard drive. Does it make any sense for me to use this feature?

Thanks!

No. If the data gets corrupted bitlocker will fail to decrypt; duplicacy will get no data.

Instead, you can backup to unencrypted drive, and enable erasure coding.

Erasure coding, however, is not a panacea, and does not guarantee anything, it just slightly improves chances of recovery from some disk failures.

If you want reliability— backup to storage that provides data integrity and availability guarantees. Good examples would be zfs checksummed array, or Amazon s3, or storj, etc. single hard drive is not such storage.

Answering your question: personally I don’t. If I think to enable erasure coding — that means I don’t trust the storage, in which case I pick different storage, not try to slightly improve the broken one with workarounds.

1 Like

Thank you for the advice!

What settings should I choose? The default seems to be 5 data and 2 parity shards (-erasure-coding 5:2). Unfortunately, the GitHub Wiki isn’t very helpful. I think the devs should document the software better (perhaps add script examples and so on) and put it in one place instead of half being on the website and other on GitHub.

This is something only you can decide. As suggested above, I would not use erasure coding at all and instead change target media to the one that can provide data integrity guarantees.

Or you can enable it with default, “good rule of thumb”, parameters. This will tolerate up to two bad/rotten blocks groups within a single chunk. If you think you would need more — the solution would not be not to keep cranking up parity blocks. The solution would be to use storage that provides data integrity as a guarantee, and not on a “best effort” basis.

I cannot stress this enough - don’t backup to solitary hdd. Backup shall be more reliable than your source, to save the day when everything else fails, and a single hdd is anything but.

Yes, I understand backing up to a RAID or cloud is a preferable from storage safety perspective. Many of us simple user, however, find it more convenient (and cheaper) to back up to an external USB hard drive. I use two external hard drives with one being off-site.

I use Erasure Coding on external drives. Obviously, keep in mind it’ll bloat the storage - i.e. the 5:2 ratio suggested by the GUI (which I use) will use 40% more storage, but in these occasions I tend to combine this with -zstd-level best, so it’s a good compromise.

Saspus is correct in that an encrypted drive will hamper recover efforts, making the Erasure Coding worthless. However, without an encrypted layer above that, it can be effective in situations where you might develop a handful of bad sectors. Duplicacy has robust encryption of its own that will work with this feature.

Since no storage is ‘guaranteed’, the really important takeaway is to practice 3-2-1 - multiple copies, in multiple places. And external drives are just fine - if you assume it may suddenly die one day.

The beauty of this option is that you can copy between storages which have different parameters. Cloud / zfs don’t need erasure coding, so you can copy snapshots between these (which might have no erasure coding and light compression) and external drives (which might have erasure coding and heavier compression; or as you prefer).

1 Like

Thanks for sharing, Droolio!