Erasure Coding and Copy Command

Solverz · 9 June 2021 21:40

I understand when erasure coding is enabled a file is broken into “pieces” and then parity “pieces” are added to this data.

Now as you enable erasure coding on a storage, what happens when you use the copy command to copy this storage and all the backups to a 2nd storage, for example for off site replication. Are the “pieces” the files are broken into with the parity “pieces” copied or are just the normal files copied?

Additionally, how does it work when erasure coding is enabled on both storages? Are the “pieces” that the file is broken into with parity copied to the 2nd storage and then they are broken down again into more “pieces”?

Apologies if I have understood this wrong!

saspus · 9 June 2021 22:03

You can only copy to and from a storage if it was created as copy-compatible to your other storage.

Nothing is assembled or disassembled during copy – chunks are copied as is (which also explains why you can’t copy between random storages – only between copy-compatible ones)

Solverz · 9 June 2021 22:06

Hi saspus,

I understand you are only able to copy to and from a copy compatible storage, but I still don’t understand how erasure coding works if enabled on both storages.

saspus · 9 June 2021 22:10

Erasure coding affects content inside a chunk. During copy entire chunk is copied – so it does not matter what’s inside; erasure coding and copy are done on different logical level and don’t affect each other.

Maybe I don’t understand your question…

Solverz · 9 June 2021 22:23

I understand what you mean, however I don’t think I explained my question very well, let me try give you an example.

Storage A has erasure coding enabled which affects data inside the chunks.
Storage B also has erasure coding enabled which also affects data inside the chunks.

Now when backups from storage A (chunks have been affected by erasure coding) are copied to storage B, am I correct in understanding that the already affected chunks are affected again by storage B erasure coding or does storage B erasure coding somehow reuse what storage A erasure coding did to them?

saspus · 9 June 2021 22:43

Erasure coding makes sense inside the chunk; it is not affected by where does that chunk resides.
Storage is simply a folder where collection of chunks is kept. nothing else. It’s a passive directory structure.

Therefore this sentence

is meaningless: chuck is a file, it cannot be affected by where it is stored. When you copy file from folder A to folder B – nothing happens to the content of the file. It’s a copy; the content of the chunk file does not change when it is being copied between storages.

Therefore nothing changes with respect to erasure coding – because file did not change.

Storage does not “use” erasure coding. Only chunks do. And if chunk contained data written with redundancy – it will continue containing that same data written with that same redundancy after it had been copied to another storage. You can in fact copy chunk files manually, without using duplicacy copy – this should demonstrate that there is no magic shenanigans going on when copying stuff; it’s literally just a file copy

Solverz · 9 June 2021 23:04

That makes much more sense now thank you.

So if I understand correctly, another example…
I should be able to copy Storage A (Erasure Coding Enabled) to Storage B (Erasure Coding Disabled) and the erasure coding should still work on that data in Storage B even though it is disabled as the data has already had erasure coding applied when it was created in Storage A?

So in other words, the erasure coding happens in the chunks and not the storage? Which means the backups can be moved anywhere and erasure coding would still work?

Am I correct in also saying that even though Storage B has erasure coding enabled, it will see that the data copying from storage A already has erasure coding and won’t try reapply it to the data?

I also assume that the system coding 4+2, 8+3, 17+3 etc cannot be changed once applied to the data?

gchen · 10 June 2021 02:01

This is incorrect. I think @saspus only referred to the bit-compatible copy – if the second storage is created with the -bit-identical option then both share the same config file and chunks can be copied with third party apps such as rsync or rclone. But you can’t have one storage with Erasure Coding enabled and the other disabled if the -bit-identical option is used.

Edit: the option name is -bit-identical, not -bit-compatible.

saspus · 10 June 2021 02:19

Hmm. There are two terms around adding (copy|bit)-compatible storage:

bit-compatible. Most compatible and straightforward to understand, safe with identical configs, chunks can be copied as is.
Copy compatible, as described here:

-copy <storage name>                 make the new storage compatible with an existing one to allow for copy operations

How is the second case different than the first? Does not duplicacy add -copy imply -bit-compatible?

In other words, is the following statement correct: "unless storage was added as -copy or -bit-identical one cannot copy between storages, including with duplicacy copy"?

gchen · 10 June 2021 03:25

That is correct. -copy means the new storage will use the same chunking parameters as the old one so identical files will be split in the same way on both storages, but the resulting chunks can be different due to different encryption keys or erasure coding settings (what is stored inside each chunk is still the same though). Only -bit-identical can guarantee that resulting chunks are exactly the same.

Solverz · 10 June 2021 05:38

Okay, I think I understand this.

However I’m still confused on what actually happens when backups are copied (with copy compatible) from storage A to storage B with both having erasure coding enabled. When storage B receives the data does it try to apply erasure coding again or does it realise it has been applied already? Or somthing else?

Should I be using - copy compatible or - bit identical

gchen · 11 June 2021 21:36

A copy job doesn’t simply move one chunk from the source storage to the destination storage. There is some processing involved. Specifically, a chunk downloaded from the source storage must be decoded, decrypted and then decompressed to recover the original data. The original data are then compressed, encrypted, and encoded depending on how the destination storage is configured.

saspus · 12 June 2021 02:06

Interesting. What is then -bit-identical flag needed for if copy operation does not require it?

And if copy does all that work decoding and re-encoding (which is effectively restore/backup) — why do we need to explicitly add storage as copy compatible in the first place? In other words, what prevents copy from working between any storages and what makes two storages not copy compatible?

gchen · 14 June 2021 03:51

Same files are split in the same way in copy-compatible storage, so copy-compatibility guarantees that if you ever need to run a backup against the destination storage then it will reuse existing chunks as much as possible.

Droolio · 15 June 2021 12:43

Would it be feasible to implement a file-based copy job from non-copy-compatible storages? Perhaps still directly in RAM? Since Duplicacy already repacks chunks between storages, it doesn’t seem much of a stretch to process snapshots one by one and repack files.

I guess it wouldn’t be as efficient doing it that way but it’d obviate the need for them to be copy-compatible, allow different size chunks, and maybe even removing redundant chunks due to more efficient packing.