Best practice for B2 bucket usage

ebenzacar · 29 March 2022 13:00

I’ve been reading around about RSA encryption and the ability to back up several different sources to the same bucket securely without needing to share the decryption key.

Does this suggest then that a best Duplicate practice would be to put as much data into a single B2 bucket to leverage as much de-dup as possible assuming that no individual source needs to have the decryption key?

In my use case, I have a myriad of different data that I am looking to back up: data files, media, photos, etc from a bunch of different sources under my control.

I would expect that enabling RSA backup on each different machine to the same B2 bucket would allow me to safely and securely back everything to the same location, even though my instinct would be to create different buckets for each machine or each type of dataset (and their retention requirements).

It would seem also like an anti-pattern to push as much data as possible into a single bucket.

Thanks for any suggestions/advice.

Eric

towerbr · 29 March 2022 16:28

If you search here on the forum you will find users who prefer to put everything in one - or a few - buckets and others who prefer to separate files by origin (files, photos, databases, etc) in several buckets. I belong to this second group. I prefer more granular management, which even makes it easier to create different maintenance schedules and practices if necessary.

Anyway, remember what I explained here about deduplication, that it works better for similar files or versions of the same set of files. So it’s no use putting all types of files in the same bucket as deduplication won’t necessarily work better, if they are files of very different nature.

saspus · 29 March 2022 20:17

A few points.

A counterexample @towerbr referred to is me: I prefer dumping everything into one bucket. In the same vein I don’t partition drives, don’t create extra volumes, and more often than not don’t even separate data into sub-folders. I personally don’t think that the filesystem is better place for metadata than anything else, including Spotlight ad/or other indexing solitons — so that’s what I use. It’s also faster to get to the right data that way.
with b2 specifically due to “happy conincidence” in how duplicacy works and what b2 api allows to do you can generate keys with carefully picked subset of permissions to be able to backup but not delete/modify (duplicacy files are immutable anyway). This removes the concern they compromised clients machine can destroy b2 bucket contents. Another option is of course cloud side snapshots — but this is less granular and can still result in loss of data accumulated between snapshots.
for media and photos and other inherently immutable incompressible non-deduplicatable files duplicacy provides no benefits, only overhead and ram usage. For this type of data you could simply rclone it to the destination configured without the modify/delete permissions set: to prevent versions of those files (that are always due to corruption) from overwriting your data.

ebenzacar · 29 March 2022 21:06

Sounds like the “Google” way. Just use indexing and search to find everything. I guess I’m more old school, where I still like the old style way of actually organizing data in files and folders, etc. And whereas Search works great to find a specific match, it doesn’t work as well to show views of different collections or how relationships are organized (ex: repair_bills/car/model/year/etc).

I would still have thought or expected duplicacy to be able to dedup things like music, pictures and video. Even though they are somewhat incompressible, I would have thought that blocks of these file types could still potentially be duplicated in other files. Is that unlikely to find?

Thanks,

Eric

saspus · 29 March 2022 22:00

You’d be surprised how well it works! At least on MacOS with Finder, not sure what’s the state of affairs on windows.

Here is my thinking process:

[quote=“ebenzacar, post:4, topic:6200”
or how relationships are organized (ex: repair_bills/car/model/year/etc).
[/quote]

You see, this is just one specific way of organizing relationships. Is it special in any way? Why organize by car first and year next, and not by say, all repairs in given year? Or by failure type? Or Mechanic name?

Singling out one specific permutation of attributes seems wrong, moreso when the data from files leaks into the filesystem structures (in the form of specific hierarchy)

You might fix it by including all that information in the file name, instead of folders, this would avoid singling out one specific way of sorting as it is not special in any way. Then filtering the folder by, say, “Honda”, “2015” will show you list of receipts for your 2015 Honda.

But what about other information, that is not in the title? Mechanic name? shop address? Eventually, you would want to include all data; at its extreme entire contents of the file becomes a reference to the data itself: indeed, why only filter by filename (which is also an artificial construct) when you can filter by the whole contents?

Then you can supplement it with Tags (which is something could not make myself use effectively, probably also too old school :)). I do however use Smart Folders in my Mail client – all mail is in Inbox and I have over 50 Smart Folders that present various slices of the inbox. Actually, not unlike gmail.

Fun fact: B2 and other bucket services don’t have folders. It’s an illusion. They have object names that have / in the name, and it’s your client (and web) software that simulates existence of folders

I’ve replied to your original thread before seeing your reply here, sorry about that.

blablablaman · 10 April 2022 15:15

Hi saspus, you have one filesystem/bucket for everything except for media/photos? Is this correct? Then you’d have to maintain 2 separate folders, don’t you? One for duplicacy and another for another duplication tool (say rclone). Is this correct?