Well, you certainly make a strong case, and I hear you. I especially take the point about the integrity of the data.
The only point where I might come at it slightly differently, would be in terms of the value I place on the time to switch to a new cloud provider in the event of failure. You are clearly 1000x more advanced user than I, and it’s probably nothing for you to set up a new cloud storage if something goes awry with Storj. Whereas I would much prefer, after this period of experimentation and time spent, to set this up and not have to worry about it for a long time.
And in that light, I just think there’s lots of reasons to believe that Google is a more reliable long-term provider than Storj. It’s beyond the terms of a technology discussion, but I also think that Storj’s ICO funding model is not totally irrelevant to measuring how long they will be around. Most stable companies do not unload all of their stock for cash, nor pay their suppliers in stock. I take your point that Storj will eventually have to by the Storj token; it’s just that I think at that point the end might be near for their viability as a company.
But we agree: nothing is guaranteed about the future (certainly not these days) so we can only make choices in the present. I’m very intrigued by Storj, and also by the GCS autoclass options. Maybe I set up both and see how it goes…
That was never stock to begin with, company value has nothing to do with token value. It always was a utility token.
Agree with the rest though, it will take a lot more to bring google down than storj, just based on the size.
So perhaps dual-region google or aws is the way to go – least probability of needing to change things in the future. But then there is cost of risk – paying 3x now for storage to save some time in the future migrating – seems too much money.
There is a subtle distinction between backup and archive. The archive is never expected to be needed, and almost definitely not in its entirety. The pricing reflects that – cheap to get data in, cheap to store, but very expensive to retrieve, and punishable for early deletion.
Google punishes more than Amazon, so give then choice, I would go with Amazon: they also provide 100GB of free egress monthly – and that shall cover the occasional “ooops deleted wrong file” use case. The problem is that because duplicacy does not support data thawing you are limited to instant retrieval classes, and this alone brings cost to over $4/TB/month + overhead, and still provides only single geo location.
At this point storj becomes very attractive.
I think the aversion to migrate in the future should not be a consideration: instead, what’s best for your data today shall be used. In case of storj – if they jack up prices or disappear – I’d just migrate. Switching is not a big deal regardless of technical ability, if you can run duplicacy – you can migrate your data: setup another destination, run duplicacy copy, done. This can even be done in the Web UI.
My recommendation would be to try a few targets for a few months, see what works best. It does not need to be the cheapest – I’d argue your time setting things is not free and must be factored into everything. And if it turns out Amazon hot storage is the most user friendly – maybe extra cost is worth it. Especially, if you have sub 1TB – cost is irrelevant. It’s all under $10/month anyway
Somewhat relevant article on long term data preservation:
I have 2.5TB backed up to GCS and my monthly cost is ~$4 USD. I do not perform chunk checks on the uploaded data. I’m using the archive storage class, so if I ever need to restore from GCS it will be very expensive. The initial upload was a bit more expensive because it generated more “operations”. First invoice was for $35 USD.
I have the same 2.5TB backed up to iDrive also, and there the monthly bill is ~$9 USD. I check all uploaded chunks, and the initial upload did not result in any extra cost.
I started using GCS as an extra layer after reading about people not trusting iDrive (which is my main offsite backup target). However, after 2 years of completely trouble-free iDrive usage I think I will cancel the GCS backup.
I would not – specifically because you don’t know if it’s trouble free until you need to restore. I.e. if the data silently rotted you won’t detect it until you try to restore your files.
This seems too high. What is the breakdown of charges? Is that API cost, early deletion penalties, or…?
For comparison, with another backup tool that supports Glacier Deep Archive, last moth I paid $3.56 ; I have 3.0TB there total, part of which is kept in hot storage (indexes).
Amazon charges $1/TB/month for storage, google – $1.2, but that won’t explain almost 3x difference.
possibly you’re more familiar than i am and i’m misreading the charges, but afaik Autoclass doesn’t charge separately for retrieval or deletion from the ‘archive’ class. it’s all covered in the ‘management’ charge that you pay for the bucket ($0.0025 per 1000 objects per month). my usage is so low it’s hardly worth my time to work out if it would be cheaper to self-manage the classes.
Obviously, you shall never hand manage app internals. That app I’m using does it automatically — keeps hot data such as metadata in hot storage and data blobs in archive.
Support for Glacier was requested from duplicacy long time ago. Duplicacy already uses separate metadata chunks but piles them in the same “folder” as data chunks. Moving them to a separate folder is a first step that would alow to backup to glacier tiers that automatically move things to archive. Implementing the support for thawing at this point would not be difficult either, to support automatic restore.
I guess there is not much demand? I’m not sure. Seems to me like a no brainer, especially since some of competition does it
yes, the management fee is not insignificant in relative terms. the ‘autoclass archive’ storage costs me 50p per month, and the ‘management’ fee is 25p per month. then there’s a few bits here and there for operations, and VAT. in absolute terms not worth worrying about, but if you had a lot of objects (as opposed to data size) it could be worth optimizing.
The good thing about iDrive is that I can verify the blocks once a month for free (egress is not unlimited free, but I think it is free up to 2x your storage pool size). So I know the data has not silently rotted.
iDrive is slightly cheaper than BB, so I don’t understand what makes you say $9 is high for 2.5TB. Yes, it is more expensive than GCS and Glacier and such, but on GCS/Glacier I will pay quite a lot for egress if I want to verify my uploaded chunks.
I’ve now got my duplicacy backup running to GCS. I underestimated the effects of offloading some media files (rclone copy to ODB) and Duplicacy’s compression, so I’m actually looking at only about 50Gb. I’ve set the backup to run twice a day.
Question: given that I’m trying to balance cost of using GCS with integrity of my data, how often should I run check and prune commands? Once a week? (more or less?). And should I simply never run check -chunks?
Interesting discussion, much of it seems to be centred around storage costs and cloud. Ultimately, I’ve found the ‘best storage backend’ is not cloud, but your own - hosted off-site.
Terramaster, for example, do some very nice and cheap NAS boxes. You can even run your own OS on 'em - Proxmox, TrueNAS, openmediavault, whatever - shuck a couple cheap external drives for ZFS mirror, install Tailscale on there. Set it up at a family or friend’s basement.
Buying vs renting.
When you factor in the long term costs (the lifespan of those drives), cloud is so much more expensive. Plus you have so much more flexibility and control, and you can ramp it up to 4+ bays, one on-site and one off - for 3-2-1 (for both you and friends/family).
Obviously it may depend on how much space you need, but when you get into the TBs range, DIY is a no-brainer IMO.
Buying used old enterprise hardware makes upfront cost much lower than any commercial prosumer solution. One of my many TrueNAS servers runs on 2012 era hardware (motherboard, processor, and memory was $85 from local recycler), the 12-bay server case was $70 (with 80 plus titanium power supply), and disks are cheap on the used market from recyclers like go-hard-drive and serverparts. Cost is under $10/TB. Then you get a few used enterprise pcie SSDs with PLP for zfs special device and you get blazing fast storage server…
There is indeed ongoing cost, most significant of which is electricity (Average power consumption, with 8 drives is now 110W), but there are ways to offset it.
The check-chunks is intended to verify that the chunk was uploaded correctly. It check every chunk only once. If you don’t trust the api — you can run it after every backup, but I would not. The api guarantees that chunk is either uploaded in full or deleted: the size is sent along with upload request.
Of course, there could be bugs. For example, few years ago OneDrive would keep partial chunks silently. (But then using OneDrive for backup is not the best idea to begin with)
You can run check as many times as you want — it operates on metadata. It shall not be expensive, and may even fit into free allowance. Its goal is to mostly verify that prune did not malfunction.
But here is a thought: let’s say check found an issue. What are you going to do about it? Or what can you do about it? Literally nothing. So why bother paying for extra api calls?
Prune is up to you — if you backup once a day “never” may be appropriate, or if your data changes frequently, pruning more often may save some space
That the problem. You should not need to worry about data rot with providers that guarantee data integrity. It’s not a job of an application to do that, it’s a property of underlying storage. And same question : let’s say you discovered one chunk rotted. What are you going to do about it?
It was in a different context, comparing Google Archive with Glacier deep archive.
You can configure duplicacy with erasure coding to alleviate some of the worries about data rot. But I would instead pick providers that don’t allow data to rot.
Understandable, but even then you can start small scale, with old drives or even a Raspberry Pi. Inevitably, with historic snapshots, your needs may only increase over time.
This is completely nonsensical?
Of course you can do something about it! You can fix bad chunks quite trivially - by using 1/2 of your three copies. That’s the point of having multiple copies! You don’t blithely ignore the possibility of a hidden failure, for the same reason you run scrubs on ZFS. You want to know ASAP if the backup is good or toast at all times - not at the last moment when you need to do a restore.
We’ve been through this time and time again: you’re not checking bits of data, you’re testing backups - the whole process from backup to restore. Not just bits on a storage. Anything can go wrong between, including a bug in the software or something you haven’t foreseen. Trust, but verify.
The further advantage of hosting your own storage, is you can run all the check -chunks and test restores as much as you want.
This is not black and white and short of fully restoring every version of every backup every day you cannot know that your backup hasn’t rotted. And even then, you will only know it was good just before the test, but may have just rotted a millisecond after. Obviously, this is not feasible.
You can, and should selectively restore few files, just to make sure you have all the pieces and processes in place to accomplish that, in catastrophic failure scenarios; but this won’t help you deal with storage that is allowed to rot.
Fixing backups that rotted will mean restoring full backup revision from your other backup destination, and spending time. It only makes sense to do if restore is cheap and time is worthless. And there is nothing “trivial” about it; it requires understanding duplicacy internals, there is no supported way to heal the datastore. (Even then success if not guaranteed, even if you use the same version of duplicacy that created the backup ages ago, or you copy between storages, that comes with its own set of worms)
You have to draw a line somewhere in what you trust. You cannot trust media, because it always rots. That’s why filesystems exist that fix that issue. That’s why there are companies that keep your data intact in exchange for money. Now you have “media” that cannot rot — self healing filesystems, erasure coding, and all that jazz. You shall probably also trust your operating system, network drivers, storage, ram on the machine you test your backups. I extend that trust to amazon aws and duplicacy. Otherwise I would not have used those solutions.
Exactly. But then you either verify everything all the time, and it will be ridiculously expensive, even with hot storage, for which you would be overpaying too, or you trust the storage and software, and do sanity-checks restoring small subsets of data, and save massive amount of money not needed egress and using archival storage tiers.
Well, I scrub my storage, but I don’t run checks or prunes in duplicacy. If I did use prune – I would run checks, without parameters, because I don’t trust duplicacy’s prune. But that’s why I don’t run it… see the pattern here?
I’m largely gonna ignore most of what you said about bitrot, because it’s clear we’re not going to agree on the difference between ‘trust’ and verifying backups. But I want to cover your main premise above, that 1) you can’t do anything about bad chunks (wrong), hence 2) why bother testing (fundamentally wrong)…
Point 2 is perhaps the most important - knowing something has gone wrong. The number 1 enemy of data loss is assuming you have a valid backup, then finding out you don’t. I’ve seen it far too many times in my line of work and my early days of personal computing. It’s why everyone knows the golden rule of backups is to verify them - however which way you can.
As for point 1, repairing bad chunks IS trivially easy, I’ve done it several times with 100% success. Early detection drastically improves matters too.
You simply identify and delete the bad chunks, rename the snapshot directory (or individual revisions) out of the way, and copy from a known-good storage. Absolute worst case, you can start from scratch and avoid re-uploading most chunks by using fresh IDs. All pretty straightforward and, as proof, we’ve had a number of people come to this forum to repair their storages, saving both time and transit fees.
Failing that, let’s assume you can’t repair them. Now you know you have to replace/rebuild/redo your backups - because, at the very least, you found out - instead of blissfully being unaware that anything was wrong. This concept shouldn’t be controversial.
This is disingenuous. Nobody ever said you had to verify everything daily.
My view is that you should test each uploaded chunk at least once, no matter where it’s stored.
If the data then resides on ZFS, you can be sure there’s no bitrot (unless scrub says otherwise). If the data is in the cloud, I’d still test that shit at least yearly. Anywhere else, quarterly or monthly depending on Erasure Coding and other factors.
Even with raw disks, I employ constant monitoring with tools such as Stablebit Scanner and SnapRAID scrubs (completely automated). Yesterday was World Backup Day; I took the opportunity to verify my offsite backups with -chunks (directly on the box) and doing a test restore of my home directory. (I was gonna schedule a complete -hash backup but will do it at the weekend.) I don’t fuck around.
You don’t trust Duplicacy’s prune, but are confident that no bugs exist in Duplicacy (or its interaction with the storage API) that may result in bad data. Well, you have a non-zero chance of bad chunks… but the worst thing is, you won’t know about it.
The whole contention point is about where to draw a line in what to trust. I trust more things, you trust fewer. That’s the only difference there.
My point here is that you cannot 100% know that something went wrong without periodic full restores. With colds storage this is not feasible, so this whole idea of “100% knowlege” is out of the window. Now we are left with degrees of confidence, and making choices that increase probability of success. Such as using duplicacy and not kopia, using AWS and not iDrive. etc. But you can never know for sure if your backup has or hasn’ rotted. Nor should you need to. Some degree of assurance is enough. Nothing in this world comes with 100% guardanee.
You have skewed perspective. It’s a selection bias. It’s like doctors, who meet way more sick people than normal due to their occupation. For them it may appear that there is disproportionate number of sick ones compared to what a neutral observer sees.
Dude. It’s trivially easy for you. It’s not trivially easy for me.
Nor it is for average neurosurgeon who does not want to learn about snapshots and chunking and just wants his/her photos backed up reliably.
Yeah. very straightforward indeed. Look at these sentences from a regular user’s point of view. (And yes, I know what you are talking about, but again, I’m not an average user due to occupation related to software, so to speak).
And that woudl be great - except it’s impossible, see below and above.
No, it’s not disingenuous, it’s a hyperbole, for illustrative purposes. You suggest full restore periodically, and this rules out cheap cold storage. How much are you willing to pay for this knowledge that few chunks may have rotted? I make a choice that I accept small chance, and I reduce these chances by using storage provider who I believe know what they are doing. I’m not prepared to pay hot storage prices for my backup for the privilege to waste energy and traffic periodically downloading the full dataset.
It isn’t in isolation. It is in real world, where risk has cost and you shall optimize both.
But can you? Maybe it rotted just after scrub? Maybe it rotted twice within the same record and now you can’t heal? See, even here some trust and acceptance of risk is required.
If doing all that lets you sleep better at night, and to negate skewed perspective (see doctor reference above) then by all means.
I just scrub arrays periodically.
Nope. I’m confident that prune, that touches history, has higher chance to mess things up (and it does have known issues with ghost snapshots), than backup, that only adds data. So I don’t use high risk prune and use low risk backup. Again, it’s not black and white, it’s a spectrum. What I am confident is that everything has bugs.
Right. And I’m fine with this. Because it’s the case of the cure being worst than the ailment: I’m not going to pay to egress everything from Glacier daily/monthly/annualy/insert-any-cadence to tickle my distrust to glacier. I accept some low chance of corruption and in return I save thousands of dollars in egress from cold storage, or high fees for hot storage.
You draw the line elsewhere. It does not make you wrong nor me wrong.
To summarize: As soon as you accept that you cannot realistically periodically validate every file in the full backup history – it immediately turns into risk management. Different people have different risk tolerances. that’s all it is.
Can you point me to a write-up somewhere –ideally for working specifically with duplicacy, but if not that, then a general how-to – that expands on what you’ve said quickly in this last sentence?
I realize I have basically NO idea what to do if a duplicacy check reports bad chunks. And while I get the general idea from what you say here, I’m not at all sure about how to carry that out as a specific set of steps. (Specifically, I don’t fully follow “rename the snapshot directory out of the way”.)
This is a guide on how to fix missing chunks. The idea being you delete the bad chunk (identified from logs) from storage (from chunks folder) and then proceed through the guide to attempt to re-create it, or if that fails – delete affected snapshots.
What Droolio referred above is an approach if you have another, copy-compatible duplicacy backup elsewhere, that you can source the good version of a bad chunk from, instead of attempting a “new” backup into temporary snapshot in hopes that the missing chunk will be re-created.
Few more points:
Duplicacy’s simple design makes it feasible to manually fix these issues, but users shall not be expected to mess with storage internals, the healing feature shall be part of the app.
If the storage rot is a concern, erasure coding can be enabled to somewhat offset the risk: New Feature: Erasure Coding. (However, at this point I would just pick storage that does provide data integrity guarantees, instead of paying more for bad storage)
Trust is largely irrelevant with backup, because you’re supposed to verify not just trust. Humans are often terrible at evaluating risk - including me, including you - which is why these common rules are so oft talked about, because they’re basic wisdom garned over decades of people losing data.
Any knowledge is better than 0% knowledge.
How did you come to the conclusion that kopia or iDrive wasn’t robust? You tested them, correct? You did a restore from B2 and determined they had a fault, correct? ISTM, had you (or other people) not done any of this testing, you’d be oblivious to this knowledge.
When was the last time you ran check -chunks on your AWS? These code paths in Duplicacy aren’t the same as the rest, nor as extensively tested. So what makes you so confident the given library is free of bugs? Particularly as the very nature of the backend disuades more people from doing enough testing.
No, it’s real world experience. Making good backups requires discipline. It’s not hard, but there’s certain basics people can follow without being forced to evaluate every scenario for themselves.
When we take on new clients for example, we rarely find their backups were properly tested - if present at all. There’s no disproportionality here, it’s statistically very likely most people are piss-poor at backup when they first get into it. And that no testing very commonly leads to bad backups. By experience.
Skill issue? Seriously, Duplicacy is a tool, with instructions on how to use said tool. You clearly have very different ideas about how dumbed down said tool should be, but it does require a minimum degree of knowledge. People don’t jump into cars and drive through streets without taking driving lessons.
Aside from specifics, there’s widely familiar best practices on what to do in general - i.e. 3-2-1 and verify - without getting into the inner workings. They need only follow these recommendations and IF they run into issues, they have 2 choices - start from scratch, or come to the forum and seek out simple instructions to repair.
Either way, they found out their backups were unreliable, and avoided the heartbreak of losing data, because they gave themselves the time to remedy the situation.
Good. I never said cold storage was wise at all, in fact I strongly believe the opposite.
Regardless of whether Duplicacy properly supports cold storage (and it doesn’t), you’re basically painting yourself into a corner just to save a few quid. Cold storage is meant for archival, not backup. Backup = the process of making reliable copies, which includes testing. Without testing, all you have is the indication of a copy, not a backup. This is well known. Use the correct tool for the job.
Verifying backups shouldn’t be controversial.
This scenario is complete nonsense and you know it. I scrub often (monthly+weekly) - for precisely the reason that knowing about it sooner, allows me to fix it sooner. Every day that passes, every scrub that doesn’t error, multiplies the level of confidence in being able to restore from a backup. 3-2-1 multiplies that exponentially more.
I can agree with this assessment, but when you recommend to others here on this forum not to practice 3-2-1 or verify backups (as you have done), someone has to explain why your level of risk is so much inordinately higher than those who do.
Remember, this whole discussion boiled down to “don’t bother verifying, because you can’t do anything about it anyway” - which is patently ridiculous on both counts, no matter how much risk anyone is willing to tolerate.