Per Snapshot ID encryption

I wonder if it would be possible to encrypt the data associated with each snapshot ID with a seperate key.

Chunks would need to be encrypted with a common key as is currently the case as these would be shared between all revisions across all snapshot IDs.

The use case I’m thinking of is a shared storage used by many clients that are from different organisations but we want to take advantage of “global” deduplication across all clients.

Each org would have their own key plus the common key, and then only be able to “see” their own snapshot IDs. They would be able to decrypt arbitrary chunks using the common key, however that is a much smaller issue than being able to enumerate, restore or even delete/prune any revision from any snapshot ID.

Such a design would mean that any use of prune across all IDs would need to know all keys to decide which chunks could be deleted, but in my thoughts around use cases for this it would be for a provider who manages all access and client setup, so they would have knowledge of the keys anyway.

Apologies if this has been asked previously, however I couldn’t find it on the forums when I searched.

Possibly bad form to reply to your own post, but thinking on ways to achieve the desired goal, which is enabling global deduplication across many users/clients while only allowing clients access to their own backups I came up with than idea that may or may not work.

  1. Create seperate storages for each customer/user on the same system, ie: /backups/customer1, /backups/customer2 etc
  2. Have an additional “master” location, eg /backups/master
  3. Run a script/tool that scans each customer storage and hard-link files under /chunks/* and /snapshots to the equivalent location under /backups/master
  4. If a file already existed under /backups/master/chunks, instead of hard-linking to that location the “real” chunk would be replaced with a hard-link from there.

In the end this (in theory) would have all chunks and snapshots under /backups/master with individual contents under /backups/customerX storage…but this has basically been “single instanced” via use of hard-links.

My description is probably quite bad, but I believe the above would be a way to exploit the content based filenames of chunks in each storage to deduplicate across many untrusted users.

This would also not work for encrypted storage unless all customer storage was encrypted with the same password or key.

Possibly not a huge issue if access to each storage was handled securely via seperate users and SSH keys and file permissions.

Your hard-link idea may achieve a pseudo version of your initial post - to keep each snapshot separate and secure from other users and prying eyes, but it seems a bit convoluted imo. You could probably achieve the exact same result by simply locking down the permissions on each snapshot directory.

But then you wouldn’t be able to run a check or prune without root / master access, though each user would probably at least be able to run prune -collect-only.

I too have been thinking about the problem, especially with respect to the new RSA feature. IMHO, the current implementation doesn’t have any worthwhile use-case, as you’re not supposed to mix backups with different keys. The chunks themselves are encrypted with these different keys.

Duplicacy is already set up with the perfect methodology to achieve de-duplicated encryption aka convergent encryption. Whereas the new RSA feature encrypts chunks with different keys and leaves the metadata/snapshots encrypted with the same key, imo this needs to be the opposite way around! (And personally, I’d use another password layer instead of a key-pair that could get lost.) Yes it means pruning and checking becomes more difficult, but also not impossible. Though the ideal goal would be to allow storage maintenance without a ‘master account’, that would need a lot of work.

This has been discussed before: Privacy for multiple users backing up to shared repo · Issue #416 · gilbertchen/duplicacy · GitHub

1 Like

Oh interesting. I keep forgetting there are more technical posts on the Github. :slight_smile: As I suspected, and your second-from last post at that link confirms, we need a complete list of chunk IDs (chunk filenames) for all the snapshots in order to prune and check properly. And to ensure privacy, we don’t want chunk hashes.

Definitely convoluted and based on Issue #416 on GitHub linked by @gchen totally unnecessary if/when per repository fileKey support is implemented.