New Feature: Erasure Coding

arno · 4 September 2020 02:20

I suppose that would indeed be an option. It’d be nicer if everything was integrated, but something like Parchive could add parity to the storage after the fact. A reason to prefer this being integrated is that the parity data can be calculated and uploaded to remote storage at the same time. Computing it later would required downloading all the data again. Otherwise it would require processing a local storage and then copying all the data to the remote host. Also feasible, but not as ideal.

Anyway, it’s encouraging that parity could be added to future chunks on an existing storage. I originally assumed this would require creating a new storage destination.

Droolio · 4 September 2020 16:16

I like this idea. Something something post must be 20 at least characters.

Droolio · 4 September 2020 16:37

I did wonder myself whether this would be a concern, but the implementation complexity of adding cross-chunk parity would be a nightmare to deal with…

I personally don’t think missing chunks should, under normal circumstances, be a common failure mode. In fact, with the proper logic, Duplicacy should only upload a snapshot file when all the chunks have been uploaded. If chunks go missing later, that most definitely is an underlying storage/filesystem issue, which I don’t think parity should solve. The regular check command is inexpensive enough that it can detect such situations, and you can fix a storage in good time.

However, I do think Duplicacy needs extra checks to make sure at least the size of each uploaded chunk is correct, and there aren’t situations where we have 0 byte files due to lack of disk space and similar such failures.

Droolio · 4 September 2020 16:59

A more suitable tool might be SnapRAID, because you don’t have to recalculate the parity - only for new data. Though you’d have to arrange the sharding through something like symlinks to a distribution of the chunk directories (unless it can be done through wildcards?).

But you rightly point out, you could really only do this local to the storage…

gchen · 4 September 2020 17:23

I updated the post to include download links for compiled CLI binaries.

Yes, this is already supported.

bkeeper · 4 September 2020 17:55

So we initiate a new storage with erasure coding and the copy from a basic storage to the new storage right?
I assume:

1- we have to initiate a copy compatible storage, right?
2- Once initiate we add erasure coding?

How would that work exactly?

arno · 4 September 2020 18:01

Yeah, the cross-chunk parity was a concern of mine as well, though it’s not too different from the current case where a chunk can’t be pruned because it contains a few bytes of referenced data. Cross-chunk parity would basically just be more data that hangs around a bit longer before it’s pruned. It would increase the amount of space consumed, but that’s what parity does.

Thinking about this more though, I have more questions about the parity being within the chunk. Sure this protects against data loss within a chunk, but what if that corruption occurs in the headers? Would two lost bytes (one in the starting header and one in the ending header mean the whole chunk is lost? How robustly does it try to reconstruct the header from the two parts? Duplicacy could try to mix and match the three pieces of the header to satisfy the checksum. Does there need to be a third header to build a consensus?

I only ask, because I actually just recently started using Reed Solomon on a project and specifically made sure the parity was calculated to include the header data so it could be reconstructed in the event of a lost packet. Granted, that’s a different use case than here, but it raised the concern in my mind. In my case I also kept the number of data and parity shards outside of the packets, but that is also less flexible.

Despite any critiques I may bring up, I really like the idea of being able to include parity to provide extra protection in places where the storage or transport might not always be reliable.

Droolio · 5 September 2020 01:04

You raise a very good point actually and one which did cross my mind (re consensus). How would you include the header (which contains a checksum) in the shards, if the shard contents dictate the checksum?

Another thought is that chunk size and parity parameters can be derived from other chunks. Just not the checksum. And is a 2 byte checksum sufficient?

gchen · 5 September 2020 01:20

The header isn’t that important. Even if both copies get damaged beyond repair you may be still be able to guess the chunk size from the file size (assuming the file didn’t get truncated). The number of data/parity shards can be retrieved in the config file.

The current implementation will just bail out if a corrupted header is detected. I was thinking of having a separate program that can recover the original chunk more aggressively.

What is more likely to happen is corrupted shard hashes, in which case we don’t know which shards are untouched. However, since we know the hash of the chunk, we can just brute force all combinations of untouched shards, and for each combination verify the recover data against the chunk hash until we get the correct one.

arno · 5 September 2020 02:14

In my case, I was concatenating multiple messages and then splitting the whole in to equal sized packets. The header consisted of a message index and how big the next message was. I realized that if I didn’t include that when calculating parity, I’d be able to recover the data, but I wouldn’t know how big the next message was.

Ah OK, that’s good. I thought you were allowing for setting parity independently during each backup. Storing it in the config file is good additional redundancy and would also mean it can be gathered from other chunks.

Given the multiple combinations of damage that could occur (corrupt header, corrupt shard, corrupt shard hashes, truncation, etc.) a separate tool that can perform multiple recovery steps would be an interesting project.

Thanks for the elaboration.

gchen · 5 September 2020 03:17

Yes, create a new storage with erasure coding, make it copy compatible with the original one, then run the copy command. After that you can back up to the new storage instead and every chunk there will be protected by erasure coding.

stefan1 · 6 September 2020 20:32

How can i test/use this with the web ui? If I add a new storage I am not able to choose a command line parameter.

gchen · 7 September 2020 02:54

You’ll need to run the CLI to initialize the storage. The web GUI doesn’t support this option yet. You’ll also need to download the CLI to ~/.duplicacy-web/bin and restart the web GUI which will always use the highest CLI version.

stefan1 · 7 September 2020 06:31

Already got the new CLI version running but never initialized a storage - I try this today.

algebro · 7 September 2020 18:58

What would you suggest as a reasonable value for data and parity shards? How much extra space would you estimate it would use?

gchen · 8 September 2020 02:58

I guess 5/2 or 8/3 should be good. 5/2 can withstand a single bad block 1/5 the chunk size, or 2 individual bad bytes. 8/3 can withstand a single block 1/4 the chunk size, or 3 individual bad bytes. However, please keep in mind that no erasure coding can offer 100% protection, and a much more reliable method is to set up a second storage and copy backups between storages using the copy command.

stefan1 · 8 September 2020 17:27

Sorry to ask, but how to init for the webinterface edition? I want to use onedrive and tried:
/config/duplicacy_linux_x64_2.6.10 init -e -c 64M -min 32M -max 128M -erasure-coding 8:3 -storage-name QnapDuplicacy QnapDuplicacy odb://Duplicacy/QNAP

This generates a subfolder called “.duplicacy” but is not been recognized by the webinterface. So what do I need to do to make it work with the webinterface? (Maybe an example would be great.)

Never mind, I added it using CLI (which initialize the remote storage in my case). In the WebUI I added then the created storage path.

But a last question:
After creating this storage on my onedrive, how can I check if this is working? At the moment everything seems to be normal.

Usefulvid · 13 September 2020 12:25

This is exactly what I need. I want to migrate from Hetzner Storage box (sftp) to backblaze b2 by using the copy command and at the same time enable erasure coding.
after running the copy command once can I juse use the “duplicacy backup” command to this new storage?
I am running this on a slow ARM device. Is this faster compared to run a blank new backup with the same files to the new destination?

gchen · 14 September 2020 16:02

Yes, it should work this way. However, I’m not sure if you really need erasure coding for b2. They should already use erasure coding (and replication) to store your files.

There is definitely some overhead. They claim the encoding can be as fast as 1GB/s but with an ARM cpu it can be slower. Moreover, any parity shards are extra, so if you’re using 8 data shards and 3 parity shards the overhead will be at least 3/8 = 37.5%.

gchen · 24 September 2020 01:22

This topic was automatically closed after 20 days. New replies are no longer allowed.