New Feature: Erasure Coding

gchen · 3 September 2020 18:24

This feature is available since CLI version 2.7.0.

Usage

To initialize a storage with erasure coding enabled, run this command (assuming 5 data shards and 2 parity shards):

duplicacy init -erasure-coding 5:2 repository_id storage_url

Then you can run backup, check, prune, etc as usual.

When a bad chunk is detected, you’ll see log messages like this:

  Restoring /private/tmp/duplicacy_test/repository to revision 1
  Recovering a 1824550 byte chunk from 364910 byte shards: ***--**
  Downloaded chunk 1 size 1817347, 1.73MB/s 00:00:11 9.0%
  Recovering a 6617382 byte chunk from 1323477 byte shards: **--***
  Downloaded chunk 2 size 6591322, 8.02MB/s 00:00:02 42.0%
  Recovering a 5136934 byte chunk from 1027387 byte shards: --*****
  Downloaded chunk 3 size 5116593, 12.90MB/s 00:00:01 67.6%
  Recovering a 2515494 byte chunk from 503099 byte shards: -*****-
  Downloaded chunk 4 size 2505558, 15.29MB/s 00:00:01 80.1%
  Recovering a 3984934 byte chunk from 796987 byte shards: --*****
  Downloaded chunk 5 size 3969180, 19.07MB/s 00:00:01 100.0%
  Downloaded file1 (20000000)

To check if a storage is configured with erasure coding, run duplicacy -d list and it should report the numbers of data and parity shards:

Data shards: 5, parity shards: 2

Encoding Format

The encoded chunk file starts with a 10 byte unique banner, then a 14 byte header containing the chunk size and parity parameters, followed by hashes of each shard, then the contents of shards, and finally the 14 byte header again for redundancy:

----------------------------
| duplicacy\0003 (10 bytes) |
-------------------------------------------------------------------------------------------------
| chunk size (8 bytes) | #data shards (2 bytes) | #parity shards (2 bytes) | checksum (2 bytes) |
-------------------------------------------------------------------------------------------------
| hash of data shard #1 (32 bytes) |
------------------------------------
...
| hash of parity shard #1 (32 bytes) |
------------------------------------
...
| data shard #1 |
-----------------
...
| parity shard #1 |
-----------------
...
-------------------------------------------------------------------------------------------------
| chunk size (8 bytes) | #data shards (2 bytes) | #parity shards (2 bytes) | checksum (2 bytes) |

arno · 3 September 2020 19:31

This is neat.

A consideration I’d be interested in, is separating parity from data entirely. This would allow adding parity to any existing storage.

The first, reasonably feasible to implement, idea I have would be to consider a single chunk to be X data shards. Y parity shards could then be added at any time. This allows parity to be pruned at the same time as chunks and maintains compatibility between storage and versions that do and don’t support parity. This wouldn’t protect against missing chunks very well, but that could be mitigated by also adding a second level of parity for multiple chunks.

Actually, unless I’m misunderstanding the way you’ve diagrammed things here, it looks like the current scheme also only protects against corruption within a chunk and not for a missing chunk.

algebro · 3 September 2020 20:27

This looks great. Will it be possible to add erasure encoding to an existing storage? I’m assuming it would require downloading and re-uploading all chunks but it would be nice to be able to keep the existing revision history rather than blow it away and start over.

I’d be happy to help test btw, I’m on Linux.

gchen · 4 September 2020 01:25

I think this should be done by an independent tool which can take any directory and create a parity directory. Does such a tool exist?

This is correct.

gchen · 4 September 2020 01:29

It is possible to add the same option to the backup command to turn erasure coding even if the storage is not configured for that. However, this would become messy later, as chunks of a single backup can be either unprotected if they are inherited from an earlier backup or protected if they are new.

bkeeper · 4 September 2020 02:10

Could we perhaps achieve the same result by copying to a a new storage with erasure coding
I guess that would require an extension of the copy command, right?
but It would be useful as it would only copy necessary chunks.

Would that be possible?

arno · 4 September 2020 02:20

I suppose that would indeed be an option. It’d be nicer if everything was integrated, but something like Parchive could add parity to the storage after the fact. A reason to prefer this being integrated is that the parity data can be calculated and uploaded to remote storage at the same time. Computing it later would required downloading all the data again. Otherwise it would require processing a local storage and then copying all the data to the remote host. Also feasible, but not as ideal.

Anyway, it’s encouraging that parity could be added to future chunks on an existing storage. I originally assumed this would require creating a new storage destination.

Droolio · 4 September 2020 16:16

I like this idea. Something something post must be 20 at least characters.

Droolio · 4 September 2020 16:37

I did wonder myself whether this would be a concern, but the implementation complexity of adding cross-chunk parity would be a nightmare to deal with…

I personally don’t think missing chunks should, under normal circumstances, be a common failure mode. In fact, with the proper logic, Duplicacy should only upload a snapshot file when all the chunks have been uploaded. If chunks go missing later, that most definitely is an underlying storage/filesystem issue, which I don’t think parity should solve. The regular check command is inexpensive enough that it can detect such situations, and you can fix a storage in good time.

However, I do think Duplicacy needs extra checks to make sure at least the size of each uploaded chunk is correct, and there aren’t situations where we have 0 byte files due to lack of disk space and similar such failures.

Droolio · 4 September 2020 16:59

A more suitable tool might be SnapRAID, because you don’t have to recalculate the parity - only for new data. Though you’d have to arrange the sharding through something like symlinks to a distribution of the chunk directories (unless it can be done through wildcards?).

But you rightly point out, you could really only do this local to the storage…

gchen · 4 September 2020 17:23

I updated the post to include download links for compiled CLI binaries.

Yes, this is already supported.

bkeeper · 4 September 2020 17:55

So we initiate a new storage with erasure coding and the copy from a basic storage to the new storage right?
I assume:

1- we have to initiate a copy compatible storage, right?
2- Once initiate we add erasure coding?

How would that work exactly?

arno · 4 September 2020 18:01

Yeah, the cross-chunk parity was a concern of mine as well, though it’s not too different from the current case where a chunk can’t be pruned because it contains a few bytes of referenced data. Cross-chunk parity would basically just be more data that hangs around a bit longer before it’s pruned. It would increase the amount of space consumed, but that’s what parity does.

Thinking about this more though, I have more questions about the parity being within the chunk. Sure this protects against data loss within a chunk, but what if that corruption occurs in the headers? Would two lost bytes (one in the starting header and one in the ending header mean the whole chunk is lost? How robustly does it try to reconstruct the header from the two parts? Duplicacy could try to mix and match the three pieces of the header to satisfy the checksum. Does there need to be a third header to build a consensus?

I only ask, because I actually just recently started using Reed Solomon on a project and specifically made sure the parity was calculated to include the header data so it could be reconstructed in the event of a lost packet. Granted, that’s a different use case than here, but it raised the concern in my mind. In my case I also kept the number of data and parity shards outside of the packets, but that is also less flexible.

Despite any critiques I may bring up, I really like the idea of being able to include parity to provide extra protection in places where the storage or transport might not always be reliable.

Droolio · 5 September 2020 01:04

You raise a very good point actually and one which did cross my mind (re consensus). How would you include the header (which contains a checksum) in the shards, if the shard contents dictate the checksum?

Another thought is that chunk size and parity parameters can be derived from other chunks. Just not the checksum. And is a 2 byte checksum sufficient?

gchen · 5 September 2020 01:20

The header isn’t that important. Even if both copies get damaged beyond repair you may be still be able to guess the chunk size from the file size (assuming the file didn’t get truncated). The number of data/parity shards can be retrieved in the config file.

The current implementation will just bail out if a corrupted header is detected. I was thinking of having a separate program that can recover the original chunk more aggressively.

What is more likely to happen is corrupted shard hashes, in which case we don’t know which shards are untouched. However, since we know the hash of the chunk, we can just brute force all combinations of untouched shards, and for each combination verify the recover data against the chunk hash until we get the correct one.

arno · 5 September 2020 02:14

In my case, I was concatenating multiple messages and then splitting the whole in to equal sized packets. The header consisted of a message index and how big the next message was. I realized that if I didn’t include that when calculating parity, I’d be able to recover the data, but I wouldn’t know how big the next message was.

Ah OK, that’s good. I thought you were allowing for setting parity independently during each backup. Storing it in the config file is good additional redundancy and would also mean it can be gathered from other chunks.

Given the multiple combinations of damage that could occur (corrupt header, corrupt shard, corrupt shard hashes, truncation, etc.) a separate tool that can perform multiple recovery steps would be an interesting project.

Thanks for the elaboration.

gchen · 5 September 2020 03:17

Yes, create a new storage with erasure coding, make it copy compatible with the original one, then run the copy command. After that you can back up to the new storage instead and every chunk there will be protected by erasure coding.

stefan1 · 6 September 2020 20:32

How can i test/use this with the web ui? If I add a new storage I am not able to choose a command line parameter.

gchen · 7 September 2020 02:54

You’ll need to run the CLI to initialize the storage. The web GUI doesn’t support this option yet. You’ll also need to download the CLI to ~/.duplicacy-web/bin and restart the web GUI which will always use the highest CLI version.

stefan1 · 7 September 2020 06:31

Already got the new CLI version running but never initialized a storage - I try this today.