Deduplication without access to each-other's data

Hi all,

First and foremost, thank you for a great piece of software!

I would like hear your opinions on an alternative way of encrypting the storage. It’s not necessarily a new idea, but it would enable a particular use-case I had in mind.

I searched for existing topics and could not find any.

What if every chunk was symmetrically encrypted with a key derived from the clear-text payload of each chunk, before being encrypted with the globally shared secret (storage password). This would mean that anyone who possessed the clear-text data could generate the same clear-text chunk, symmetric key, and therefore the same encrypted chunk, allowing global deduplication to work. But would not allow a client who did not possess the clear-text data to decrypt the chunk, effectively preventing one client from reading another client’s data.

Naturally, one would not need to decrypt any chunk if one already possessed the clear-text, so to allow decryption if said data was lost (which is after all the purpose of backup), the symmetric key for each chunk would have to be kept by each client who generated the chunk (wether existing in storage or not), encrypted with a private, non-shared secret (per-client private password), along with other metadata before itself being sent to storage.

If I’m not mistaken, I would think that functionality that doesn’t require decrypting the actual chunk payload data would still work. I.e. checking, copying, pruning etc.

This should allow several semi-trusted clients (repositories) to share a common storage, with deduplication, without anyone having access to each-other’s data.

I understand that RSA would allow a similar use-case, where several clients who should not be able to decrypt each-other’s data can share a storage. However, it would naturally allow the holder of the private key to decrypt all data.

Thank you for your time,
Alex

1 Like

This is a really awesome idea. Correct me if I’m wrong, but I believe this is called Convergent Encryption, and really, the only weakness is a ‘confirmation’ attack which is mostly irrelevant for shared backup storages anyway.

In fact, I’d say this could and should be combined with the RSA feature to shift the public key encryption to the metadata (the symmetric key used to pre-encrypt each chunk) and leave the chunk data to be encrypted by your symmetric key and then the shared storage password / chunk key - which should be deterministic).

Doing this would allow clients to choose to use the same RSA public-private key (where the symmetric key metadata would itself get deduplicated) or not (and the metadata for decryption is stored for each public-private key pair).

Ah yes, thank you for linking to that Droolio! That does sound exactly like what I meant. I’m glad to hear it doesn’t seem to have any obvious further weaknesses for this particular use-case.

I can’t comment on your suggested use of RSA in combination as can’t say I understand the details of that implementation. I think I understand that you would get further deduplication of otherwise client-private metadata by using the RSA. However, I would think the holder of the private key would still be able to decrypt all data? I’m probably misunderstanding something.

I was simply envisioning a fundamentally per-client private system where, besides the confirmation attack, there wouldn’t be any further weakening of the encryption. All data would be client-private, except for multiple clients generating identical chunks when they had identical data.
I’m not sure exactly how much redundant metadata would have to be saved by each client for that to work though.

Thanks for commenting!

Yep, that’s right, but the current RSA (private-public key pair) implementation in Duplicacy encrypts all chunk data with a public key, which disallows a storage be shared with other users who might want to use their own key pair - and thus benefit from de-duplication. (AFAIK, Duplicacy only allows one private-public key pair in a storage, but I could be wrong? Either way, no deduplication with RSA and different key pairs).

With your convergent encryption suggestion, you’d have to second layer encrypt the symmetric keys per-client anyway, and IMO the private-public key is a perfect way to do that.

By only encrypting the symmetric key metadata with RSA and not the chunk data, each client could have their own private-public key pair and their own copy of metadata used to unlock the deduplicated chunks.

Individual clients can use the same key pair for multiple repositories, or even multiple computers owned by the same user. A family member could use the same storage but with their own key pair.

Incidentally, the main benefit of RSA (private-public keys) is so that a compromised client, with a public key stored on it, wouldn’t be able to decrypt without the private key, which can be held off-site).

1 Like

Thank you for explaining. I think I understand it better now. And yes, that does sounds like a very interesting solution!