Minio vs Local Storage

Charles · 23 December 2018 00:49

Is there any advantage of using minio over local storage or local storage over minio?

I have server that I want to copy to external storage that also has minio.

There’s literally no difference to me since it’s all local. I was just wondering if one was better behind the scene for some reason or if one would run faster than the other. I’m not limited on cpu, memory, or other resources either besides maybe storage.

TheBestPessimist · 23 December 2018 08:08

If they’re both the same machine I think that using local storage should be more efficient simply because there’s no extra layers of transport and data processing that your backups have to pass (eg. all the processing that the minio stack has to do).

Flibble · 23 December 2018 12:11

Only advantage of Minio on localhost I can think of is bitrot detection - if you can spare more than one drive…
Duplicacy check is pretty slow, so another reliability layer can be useful.

But if you plan to use Duplicacy copy, then is it’s probably overkill.

saspus · 23 December 2018 21:06

That’s hard to say without actually measuring performance. Minio may be working as a caching layer between duplicacy and the filesystem and improve performance.

I guess it all depends on what “local” means. Usb Drive? DAS? NAS? The bit-rot detection may also be helpful but some file systems do it on their own, eg BTRFS or ZFS. How much available memory is there to allocate to Minio caches vs filesystem cache? Storage performance for duplicacy workload vs Minio backend workload, Etc, etc.

All else being equal though, the fewer moving parts are there the better from reliability perspective. And after all, this is a backup tool. It shall be running at low priority in the background the slower the better, as to not rob resources from other tasks and therefore perofmace is irrelevant really. I would go with SFTP to bit-rot award server. This would be the most barebones solution maximally separating responsibilities.

mfeit_duplicacy · 28 December 2018 14:34

If you only have one form of storage, anything you can do to prevent it from failing will increase the probability of being able to successfully recover data from it. If Minio or a rot-correcting filesystem will do the job, use it. I’m running my local storage as files on an ext4 filesystem which can rot, but I have secondary cloud storage checks and corrects at least once every 90 days.

I debated using Minio because I have multiple machines (some offsite) and ultimately decided that SFTP would work just fine. As a bonus, there was one less hole poked in the firewall and one less service that needed securing.

Charles · 30 December 2018 05:27

I love all of the well thought out answers. This backup is in addition to my backup to GCP. I was using Bvckup 2 to backup to a couple external HDDs to have a second backup in case anything ever went wrong with duplicacy/gcp/encryption codes. These ran out of space and I didn’t want to split it to yet a third drive.

At first I was just going to use duplicacy with some external hard drives and split the backup. Hoping the deduplication would hold me over for a while. Then I decided to resurrect an old NAS I have and RAID 0 a couple of drives together so I don’t have to split into multiple backup sets. Then I decided that since the NAS was too slow (half the speed of GCP or less comparing to the old initial GCP logs… hard to tell), so I bought an old 12 bay server, second raid card, some 8088 external connectors, etc all cheap on ebay and I’m going to setup a DAS.

If I add it directly, I agree that anything additional might be unnecessary overhead. I think I’m actually going to connect it to a virtual backup server on the same machine though and since there’s a lot of overhead on samba/ windows file shares, I will probably use mino (usually, if you ask for 1 file exists, it downloads the entire directory listing each time you check a file (this happens transparently, but it happens)).

I also like the suggestion that using mino will provide bitrot detection. Another reason to use it. Duplicacy check is dreadfully slow in my experience as well… understandably so, but still. I wouldn’t have even considered this without your replies so THANK YOU!!!

My last consideration is that I have most of my servers on and offsite, backup to my main server storage, which then gets backed up 2 more times. If I can skip my main storage, create a second encrypted bucket for mino, and replicate that to GCP directly from mino as the site claims, it might save me some storage space/ allow me to keep more redundant snapshots.
“In addition, you may configure Minio server to continuously mirror data between Minio and any Amazon S3 compatible server.”

Again, thank you all for your thoughtful responses.

saspus · 3 January 2019 09:01

For what its worth, I’ve installed Minio on the same server that hosts my duplicacy backups over SFTP, which happens to be Intel Atom C3538 based machine with 16GB ECC memory. I have started to copy the storage from sftp to Minio, and that would have taken three days, according to how fast it was going, at 30MB/sec on average.

So I aborted and ran duplicacy benchmark on each storage instead, three time in a row each, recording results from the last run, and watching CPU utilization on the server.

Minio: Cpu utilization around 15% by minio

alexmbp:~ alex$ duplicacy benchmark --storage minio
Storage set to minio://us-east-1@tuchka.home.saspus.com:9000/duplicacy
Generating 244.14M byte random data in memory
Writing random data to local disk
Wrote 244.14M bytes in 0.41s: 600.77M/s
Reading the random data from local disk
Read 244.14M bytes in 0.04s: 5772.33M/s
Split 244.14M bytes into 50 chunks without compression/encryption in 1.52s: 160.82M/s
Split 244.14M bytes into 50 chunks with compression but without encryption in 2.04s: 119.65M/s
Split 244.14M bytes into 50 chunks with compression and encryption in 2.07s: 117.73M/s
Generating 64 chunks
Uploaded 256.00M bytes in 8.18s: 31.28M/s
Downloaded 256.00M bytes in 3.13s: 81.91M/s
Deleted 64 temporary files from the storage

SFTP: CPU utilization around 4% combined by two sshd proceses

alexmbp:~ alex$ duplicacy benchmark --storage tuchka
Storage set to sftp://alex@tuchka.home.saspus.com//Backups/duplicacy
Generating 244.14M byte random data in memory
Writing random data to local disk
Wrote 244.14M bytes in 0.45s: 544.40M/s
Reading the random data from local disk
Read 244.14M bytes in 0.05s: 4940.23M/s
Split 244.14M bytes into 51 chunks without compression/encryption in 1.51s: 161.56M/s
Split 244.14M bytes into 51 chunks with compression but without encryption in 1.97s: 124.07M/s
Split 244.14M bytes into 51 chunks with compression and encryption in 2.08s: 117.57M/s
Generating 64 chunks
Uploaded 256.00M bytes in 2.95s: 86.77M/s
Downloaded 256.00M bytes in 3.35s: 76.31M/s
Deleted 64 temporary files from the storage

Why writes to Minio is 2.5 times slower I’m not sure. Perhaps there are some tweaking to be done, but for this specific use case SFTP seems to be superior, and the caching and optimization that Minio could have provided did not materialized with the default configuration. So I nuked the whole thing and will continue to use SFTP.

Charles · 5 January 2019 00:43

Thank you for your comparison results. I will try this as well when I get mine setup. I won’t be able to work on it for another week or so. I will post my results once I have them. I’ll see if I can try it with and without different amount of upload threads too (usually 8) to see if that makes a difference.

ShockHota · 19 January 2019 13:46

Is there another beneficts from using S3/Minio over SFTP ?

As stated here Backup using ARQ on Minio, it seems there is

saspus · 19 January 2019 18:35

over SFTP like atomic writes of files (faster and less error checking required by Arq),

Checking file size is not that hard and transferring files over SFTP reliably has been polished to death.

checksums of uploaded data (so Arq can verify the NAS received the correct data),

This is not the job of a backup tool. This needs to be ensured by transport — SFTP in this case. Data is encryptred during transfer; corrupted data will fail to decrypt and will get retransmitted.

and much faster validation of data (comparing checksums instead of downloading data to compare).

Same. If the chunk is uploaded it must be assumed to stay the same. It is not a job of a backup solution to validat it. Host filesystem must protect it from bit rot. All validation should do is verify that the chunks required to restore files are present.

And yet, adding minio adds another layer of complexity that can fail. I would trust SFTP that existed for decades much more, and that not mentioning performance impact.

(Disclaimer: I may be biased because I wholeheartedly despise Arq based on 4 month of dealing with it, their support, and their idiotic articles)

Charles · 30 April 2019 14:09

TL;DR: The simplest solution is always best.

I didn’t forget about this… it just took me a long time to get to it due to some life changes. As mentioned by saspus, the extra overhead doesn’t seem worth going through minio after using it for a little while in combination with some other software I had. The additional effort of configuring and maintaining another service properly, etc wasn’t something I was looking forward to. One of the services I tried to use said minio didn’t have the ability to store timestamps. The “advantages” of using it aren’t evident out of the box. I never even looked into how to enable or tell if bitrot detection was enabled.

Bit rot in itself may be an overstated feature considering I have my drives in a hardware raid. After some research, it seems to me that the only reason why some software not-raid-but-raid-like systems have bit rot detection is because the possibility of bit rot is introduced at a software layer. From what I’ve read, your hardware controllers will properly detect, fix, and otherwise report bit rot problems appropriately without additional software.

So, back to performance? What is the best possible performance? The short answer is I don’t know. Based on my understanding of the available protocols though, I’d say NFS if you’re on the same network because it’s the fastest file sharing protocol I know of, but this is a complete guess. I personally forgot that I wanted to put the drive in my virtual backup server and attached it directly to the machine to be backed up… so I’d say I have the most performance possible at the moment. This will also require the least amount of maintenance and monitoring since I don’t have to make sure the backup server stays up. I’ll continue to let that server be responsible for my other backups, but it all eventually propagates to my main storages anyway.

saspus · 30 April 2019 15:06

No, this is common misconception and it completely incorrect statement. The only way to recover from bit-rot is with involvement of higher level software, e.g. some sort of check-summing filesystem that can tell which one of the copies of the data is correct vs corrupted.

Imagine RAID1. One of the sectors on one of the drives demagnetized and flipped one or more bits. Now disk 1 and disk 2 has different data. Which one is correct? Raid controller cannot know – there is no parity in the data itself to decide. If one of the drives reported read error instead – then yes, the other would be correct. But if drives silently lies, exactly what would happen as a result of bit-rot, raid controller will have to guess, and it will guess correctly 50% of the time. The other 50% it will corrupt your data overwriting the last remaining copy of a good data on the other drive with the corrupted one.

For whether it is an overstated… well, there are fifteen billion sectors on an 8TB drive. Its guaranteed that something will happen to at lest a few thousands of them over the few years of drives useful life. Some of them will be detected and reported by drive firmware but the remaining few will come back and bite.

Here is a good video with explanation and experiment on the topic:

Part 1: YouTube
Part 2: YouTube

I’m not convinced that network performance matters in a backup tool. The backup tool itself must be performant enough to consume as little rehouses as possible so it can be throttled down and not impact actual work being done on the machine while still backing up in a reasonable amount of time; bu the destination and network performance, at least on the level where we discuss NFS vs SMB vs SFTP is rather unimportant in most scenarios.

So I totally agree that one must go with the simplest solution with the fewest moving parts and dependencies using tried and proven technologies. Duplicacy? Check. SFTP? Check.

Charles · 30 April 2019 22:30

Hmm. I suppose you’re right. The patrol reads and consistency checks of my raid controllers surely don’t keep a separate hash of all of the data, so can’t verify the data accuracy. At least I have that much of a check though and I won’t have a 50/50 guess of which data is correct, because I don’t have parity (striping only). It seems like software wouldn’t have much of a chance to catch this either though since the data it has to compare to is what was originally written to the disks, but I suppose you have more of a chance. It seems like a file system with checks in place could have a better chance, but ZFS and BTRFS, etc is not an option for me. I’m using XFS on CentOS.

It would be interesting to have some sort of software to truly validate my data periodically rather than just check for consistency. I still can’t decide if even that would be worth the overhead that software would entail, but if you know of any or any comprehensive information on what my raid controller scans don’t cover, I’d love to check out the materials.

Droolio · 30 April 2019 22:58

Not necessarily true. The main reason a hardware RAID controller wouldn’t be able to determine which drive in a mirrored array was correct, after bit-rot, is there’s no convenient place to store checksums. Software can do that.

SnapRAID? Granted, it’s designed to replace hardware RAID by combining separate physical drives, but there’s no reason you couldn’t combine. JBOD or stripe every two drives for extra performance?

Charles · 30 April 2019 23:32

With SnapRAID or any software RAID or file system supporting hashing, they can intercept the data and hash them as they are going in. The RAM in this case would be the computer ram, so you might need ECC RAM, but that’s besides the point. These software file systems/ raid options would have a better chance since it knows the data before it gets to disk, but I was thinking additional software to check for bit rot. It wouldn’t stand much of a chance unless the data actually changes on the disk over time after having been written, then read back out by a software once, not catching any immediately missed errors. It could possibly catch something, but I don’t even know of a software that does this.

Minio could detect bit rot in the same way because it receives the data before it is written to disk and can hash it before writing. It would also have less of a chance of catching any more immediate corruption if it files are added outside of the software.

Regardless, you’re right. Any level of checking would in fact possibly catch something. I’m almost more worried about how much data actually makes it in one piece when migrating it from server to server. I believe rsync can hash compare it after it’s written. I just don’t know of any software that sits on top of the OS and file system and just scans your hard drive and checks if the data is the same later. I don’t know how it could tell if the data didn’t simply change.

I’ll have to look into SnapRAID with JBOD to sit on top of my existing drives, but I likely won’t implement it anytime soon if I have to reformat/ re-init the disks.

Droolio · 1 May 2019 00:07

I understand what you’re saying.

I have a HP MicroServer with ECC memory, though I’m not too bothered that it’s ECC or not, because statistically the window of opportunity for bit-rot in RAM - between when a file is written to disk and SnapRAID runs its sync - is infinitesimal compared to when it’s doing everything else, including read that data back later, which is often less important. After the sync, I know my data is reasonably safe (I have off-site backup with Duplicacy and rclone in addition to running periodic SnapRAID scrubs). I can least repair if a hard drive fails or develops bad sectors/corruption.

But obviously, I’m mindful that copying data from my non-ECC PC to the server risks corrupting data. If I’m super paranoid about certain data, I’ll do manual hashing or even binary comparison (with a tool like Beyond Compare). Often, I’ll extract data directly from archive files, which will tell me of a CRC error. And I can force SnapRAID to immediately scrub newly written data just in case.

Thing is, though, that window of opportunity is so narrow, that it reduces the already-low statistical chance of a cosmic ray hitting that data, to… statistically even less, almost nothing AFAIC. In all the time I’ve performed manual hashing over the last three decades or so, I don’t think I’ve encountered a checksum mis-match outside of a dodgy download or known bad memory module (and boy have I had some bad memory modules in the past ).

Charles · 1 May 2019 01:02

Ha! I agree. And actual hardware failures like that almost immediately become noticeable. At least we have a reasonable backup software solution to help us when things go wrong! (haha, wraping it back to duplicacy ). I used to think photographers who shot to two cards, etc were paranoid, but I’ve had pictures from shoots corrupt all the time (just handfuls at a time most often, so you don’t notice until it’s almost too late). I can’t always say what it is, but after all of this talk, I feel like adobe should already have a mechanism in place to recheck the data it has imported when it’s done copying (I assume most of the time it’s during the copy to my desktop pc). Having multiple layers of backups is a life saver.

I have all of my computers backup to my regular data storage with 0 obfuscation so all of the files are plain with versioning. Then that is backed up (now both locally and off-site).

I once imported my wedding photos to my desktop, copied them to my laptop, copied the changes (stored in Lightroom database) only back to my desktop since the pictures were already there, deleted them off my laptop and everything seemed fine. Later when I tried to export the photos with their changes, several didn’t want to export. The data was somehow corrupted. The only working copy must have been on my laptop which was deleted. It had been longer than my local data retention of my plain files, so the most local backup of my laptop photos were gone. The most local copy of my desktop photos were equally as corrupted. I ended up pulling them off of my off-site duplicacy backup of my laptop backup since it’s retention policy is much longer (because deduplication, why not).

Bottom line… nothing beats backups… But I wonder if I should try something like SnapRAID as a fail safe with just one JBOD I also wonder if this would prevent me from being able to interact with the hardware raid using the MegaRAID storage software or if it would still be able to see the drives like normal.

Flibble · 1 May 2019 08:34

Only software I managed find that is able to do that is “Corz checksum” checksum for Windows.. BLAKE2, SHA1 or MD5 hash a file, a folder, or a whole drive/volume, with an Explorer right-click.

He is probably the only software that can warn you if hash of any of your file is different, but timestamps are same.

I did thorough testing (by corrupting data manual) and he always detect it. I’ve been using it for over six months now and I can only recommend it.

Charles · 1 May 2019 13:52

Now that seems to be a simple no-nonsense tool for someone not diving into the software raid world. Minimal maintainability using a cron job and bash script. Not to mention you can use it to verify copies. Thanks!