Recommended S3 bucket settings for Storj? One bucket or multiple?

jadonr · 29 February 2024 14:20

I see that we can now set up Storj in Duplicacy if selecting the Amazon S3 type from the web-ui. Is it recommended to create a single bucket and use the same Access ID and Key to backup all machines into that single bucket? Or create one bucket for each machine? More importantly, if I backup to multiple buckets, will the deduplication work as effectively as if I were backing up to a single bucket? Storj allows me to setup multiple projects and then multiple buckets within each project.

Example 1:
Project Name > Bucket Name (computer 1)
Project Name > Bucket Name (computer 2)
Project Name > Bucket Name (computer 3)
Project Name > Bucket Name (computer 4)

Example 2:
Project Name 1 > Bucket Name > Folder Name (computer 1)
Project Name 1 > Bucket Name > Folder Name (computer 2)
Project Name 2 > Bucket Name > Folder Name (computer 1)
Project Name 2 > Bucket Name > Folder Name (computer 2)

Logic tells me that for best dedup, to use a single project with a single bucket and then use multiple folders within the bucket that are named after each computer I’m backing up. But maybe that’s not how Duplicacy works so I wanted confirmation.

jadonr · 1 March 2024 16:09

Anybody? I would like to get started on this soon but would like clear direction before moving forward.

saspus · 1 March 2024 17:35

It depends.

Are all machines trusted/belong to you/any other security/privacy concerns? Do you need cross-machine deduplication?

The deduplication will then work within each duplicacy storage target, but not across.

Note, with Storj, you can have multiple files in the same bucket encrypted with different credentials. so you technically can end up with multiple duplicacy repositories in the same bucket but completely independent, and not interacting with each other. This is Storj-specific, just be aware. You want to use the same passphrase.

If by “folders” you mean “snapshot IDs”, then yes. Then you can generate different S3 credentials (but with the same encryption passphrase) per machine.

But this also means that each computer will have encryption key to your backup. If you want to be able to backup from multiple machines but preserve security by not allowing to restore all, including data from other machines – consider enabling RSA encryption in duplicacy.

This is highly counterproductive. It is user forum, where folks volunteer to help each other. If you want expedited support – there is a commercial license that comes with dedicated support from the developer: Duplicacy

jadonr · 2 March 2024 17:22

Thanks for the reply. All machines in the backup plan belong to me, so no security concerns there. Since we are backing up to S3 storage and paying by the amount of data used, cross-machine deduplication would be preferred to reduce the overall size.

By folders, I actually meant the Directory (see image below). I was asking if the deduplication is affected by defining a different directory at each PC I am backing up. Or would it be more dedup friendly to leave the Directory path empty and just define a unique Backup ID for each PC that will be backed up by Duplicacy.

It would certainly be easier to view the dashboard to determine how much space each PC is using if I give each PC a different storage name as well. Then the duplicacy dashboard would show exactly how much data I have in each storage. Like below:

But I would presume that it would be necessary to define a different directory for each and then this would decrease the deduplication? If the difference would be minimal, I would like to seperate.

I am also unclear if the storage password in Duplicacy should match the Passphrase in Storj (see second screenshot).

My humble apologies for pushing for an answer. I should have been more patient. I just thought maybe it got overlooked.

saspus · 2 March 2024 17:43

That is a prefix for the s3 path where the duplicacy will initialize the storage. If it’s different — there will be different storages. There will be no deduplication across them.

Yes, to have cross machine deduplication all machines must backup to the same storage (same url, same bucket, same prefix (directory))

Correct.

Nope.

Storage password in duplicacy protects encryption keys stored in the config file that are used to encrypt your data in chunks, regardless of where your duplicacy datastore is stored.

Later you may decide to move your Duplicacy datastore to e.g. Amazon AWS or local NAS and continue backup.

The Storj passphrase is different and is specific to Storj. Anything you store on Storj is end-to-end encrypted. There is no way to save data unencrypted on their network. So, you need to provide encryption keys when you send the data. When you use S3 integration that encryption passprase is store on their s3 gateway and protected by s3 secret.

In this case your data is effectively encrypted twice — once by duplicacy, and the other time by storj.

It’s tempting to not enable encryption in duplicacy to save some resources - but I would keep it, specifically for the case if you would decide to migrate away from storj in the future. Just lease your options open.

backups1 · 3 March 2024 03:32

Funny, I just rebuild my B2 buckets because of some of these reasons. I moved from 1 bucket to about 8… each of my shared folders basically gets a bucket. It’s faster to do individual backups, faster to search the buckets, easier to manage AND if there’s a corruption someplace, I don’t have to remove the entire bucket and start over, just the bucket that had issues. A bit more work to get it going, but definitely worth it for me.

saspus · 3 March 2024 03:44

You don’t need to remove the entire bucket nor start over in either case.

But if your storage provider corrupts data — the solution is not to dance around it; the solution is to switch to another provider, like, yesterday.

jadonr · 3 March 2024 22:37

So if a backup does become corrupted what are the correct steps to resolve? Coming from Duplicati where backups were regularly failing with corrupt chunks, this concerns me as well. If I’m pointing all machines to the same bucket, that would be dissapointing if I have to remove a few terabytes of data by deleting an entire bucket when there is only one machine that is at fault.

saspus · 3 March 2024 22:55

Duplicati is horrible unstable crap. I’m not surprised it was getting corrupted every time.

If your storage is providing consistency guarantees backup won’t get corrupted. Duplicacy is using CAS (content addressable storage) where each chunk is immutable and uniquely identifiable by id encoded in a filename. There is no indexing database that could get corrupted either. Everything is in the filesystem.

If your storage is not reliable — you can enable erasure coding in duplicacy to mitigate it somewhat or better yet change storage.

The source of inconsistencies with duplicacy is when storage lies about what was stored (e.g. keeping truncated file) or when duplicacy is interrupted while deleting data, keeping partial ghost snapshots.

There are ways to repair the corrupted chunks and there are ways to purge the affected snapshots if desired.

Either way, there is never need to start over or get lose entire backup. The only exception is config file. If config file is corrupted — backup is dead, as it contains encryption keys. This file is written once and only read afterwards. So, if your storage is reliable — you are good.

jadonr · 3 March 2024 23:26

Thanks for the confirmation. I’m hoping Storj will be reliable. I never had a problem with Backblaze B2.

saspus · 3 March 2024 23:42

Yes, Storj mathematically cannot return bad data — it’s end to end encrypted. So you either get your data back or none, never corrupted.

But do read other threads on storj here — you might have performance issues if your network equipment is not up to the task.