RSA Encryption Use Case - what is RSA overhead?

ebenzacar · 28 March 2022 22:21

I’m new to Duplicacy and was reading through the docs trying to understand the use case of RSA encryption. While I understand that it adds an extra layer of protection (in that the private key can be kept entirely separate from the backup) I cannot seem to understand any other real use case for this behaviour.

Reading through the docs/posts, I ran across the use case where it would allow multiple untrusted sources to backup to the same Bucket. While that is great, it also means that no untrusted source can actually restore any data themselves. It would require only a trusted user to restore the data. Is this a real use case in larger organizations?

On the otherhand, using RSA encryption to encrypt/decrypt the AES keys adds additional processing to the backup & restore process. Have there been any metrics/performance tests to identify the additional overhead that RSA adds?

Are there other use cases that I don’t see/understand?

Thanks,

Eric

bkeeper · 28 March 2022 23:12

This is exactly the use case.

I need a critical backup on infrastructure that I cannot trust. With RSA backups can be initiated by a script or untrusted 3rd parties but restores would need my private key, so data is always secure.

This is very usual actually.

Hope this helps.

ebenzacar · 29 March 2022 14:54

Actually that explanation helps a lot; thanks. At least I am starting to understand the benefits of RSA.

The biggest advantage that I saw was that using multiple sources to the same bucket helps with storage costs in that the more data that you backup to a single location, the greater the chances of deduplication, at the cost of limiting the restore ability to a single trusted entity. Otherwise, you could simply backup to independent buckets couldn’t you (at the cost of managing different buckets and encryption keys/passwords for each independent infra).

Thanks,

Eric

towerbr · 29 March 2022 16:23

You are mixing two features:

This is related to deduplication

and this is related to RSA encryption

Deduplication will occur regardless of whether you are using encryption or not.

And regarding deduplication: it will work better on similar files. Think that you have two copies of the file in two different sources (repositories), whether or not they are on the same computer. The chunks of these two copies will basically be the same and you will save storage space. Another example is when you have versions of the same file with few changes, depending on the type of file, the chunks added to each backup will be practically just the modified parts. The latter case works great with virtual machines and databases with fixed-size chunks backed up.

ebenzacar · 29 March 2022 17:11

I always figured/assumed that the larger your sample size, the more likely you’ll have duplicated blocks between completely unrelated files. Meaning that if I have a huge media collection that I’m uploading, the chance of having some smaller block of a video matching another video or zip file is more likely than just a small set of data.

That said, I started another thread precisely on that point (https://forum.duplicacy.com/t/best-practice-for-b2-bucket-usage) - if it is worth keeping buckets segmented to ‘types of files’ to increase the chance of dedup, or something to throw everything into one giant bucket. And that is where I see RSA coming in handy - the ability to allow multiple users/systems to all upload to one giant bucket with the hope that you will increase the amount of data that gets dedup’ed. So essentially, you end up backup up more data, but with minimal cost in storage.

Thanks,

Eric

saspus · 29 March 2022 21:46

While this is not wrong, practically the difference is zero. 100x zero is still zero.

It’s easy to verify: put camera on a tripod and take two consecutive jpeg pictures of the same static thing, e.g. a vase. Then put those two files together and try to compress them with some lossless compression algorithm, like zip. You will likely see almost no space savings: modern media compression algorithms are lossy, and are walking the thin line between compressing the actual data and preparing the scene to reimagine it from scratch. Let alone the tiny amount of noise that will be present in your photos will impact your data enough for (lossless) compression to be impossible to produce same results.

That is to say, the probability that you will have two same blocks of data of any nontrivial lengths in any two media files is zero for all intents and purposes. The only exception – if the file is a copy of another file.

Thus keeping media data under any compressing and deduplicating version control system is counterproductive and only provides the convenience of having backup of everything in one place. I’ve expanded on this in your new thread.