Modern IT practices don’t really contemplate disaster recovery. Even organisations with strict backup procedures seldom test recovery (most never at all).
Everything is quickly strapped together due to teams being understaffed. Preparing infrastructure in a way such that it can easily be recreated is easily twice the effort as “just” setting it up the usual way.
Actually I think this is hard to properly implement. If you're a small shop, really setting up backups with redundancies, writing the documentation, and testing disaster recovery, that's so much more work than people anticipate, and it has implications on all areas of the business, not just IT. So usually it's hard to justify to management why you would put in all that work and slow down operations—which leads to everyone postponing it.
Either that bites you sooner or later, or you're lucky and grow; suddenly, you're a larger organisation, and there are way too many moving parts to start from scratch. So you do a half-hearted attempt of creating a backup strategy held together by duct-tape and hope, that kinda-sorta should work in the worst case, write some LLM-assisted documentation that nobody ever reads, and carry on. You're understaffed and overworked anyway, people are engaging in shadow IT, your actual responsibilities demand attention, so that's the best you can do.
And then you've grown even bigger, you're a reputable company now, and then the consultants and auditors and customers with certification requirements come in. So that's when you actually have to put in the work, and it's going to be a long, gruesome, exhausting, and expensive project. Given, of course, that nobody fucks up in the mean time.
Indeed. Setting up infrastructure properly and documenting it properly is even more complex than coding, to me.
I can go back to code I wrote months or years ago, and assuming I architectured and documented it idiomatically, I takes me only a bit of time to start being able to reason about it effectively.
With infrastructure is it a whole different story. Within weeks of not touching it (which happens if it just works) I start to have trouble retaining a good mental model of it. if I have to dig into it, I'll have to spend a lot of time getting re-acquainted with how it all fits together again.
As much as Cloudformation and Terraform annoy me (thankfully I’ve never been burdened with k8s) there is something magical about having your infrastructure captured in code.
Virtualization really helps. We have a lot of weird software which requires hardware dongles, but they're all USB dongles and they're all virtualized, one of the DC racks has a few U worth of just USB socket -> dongle wired up so that if we spin up a VM it can say "Hey, give me a USB socket with a FooCorp OmniBloat dongle on it" and get one unless they're all used.
Interoperability exception might allow this in exigent circumstances when you do have a valid license, but I wouldn’t do this without running it by the software vendor whose license you are using. In a recovery situation, you’ll probably need to be on the phone a lot, so I can see how you might think it’s quicker to bypass the license check, but that is one person giving some or all of their attention just to that. Disaster recovery isn’t a one person job unless that one person was the whole team anyway, so I think this idea needs to be calibrated somewhat to expectations.
it really depends on the scenario but if the application was dockerized and they had an image, it would be just starting it again, somewhere else.
Possibly with the same network settings if the licensing check was based on that.
But of course it can easily go south, though testing the recovery of a container based off an image and mounted volume is simple and quickly shows you if it works or not.
But of course it may work today but not tomorrow because the software was not ready for Y2K and according to it we are in the XX century or something and the license is 156 years ... young. Cannot allow this nonsense to proceed, call us at <defunct number>
"hardware" does not mean "bare metal". It could be a MAC, a serial number or similar things that may be linked to a generic or clonable value in virtualization.
To some extent, yes -- having developed apps that were dockerized, and having managed virtualization systems (ESXi and similar), as well as docker engines.
If you’re doing it right, the DR process is basically the deployment process, and gets tested every time you do a deployment. We used chef, docker, stored snapshot images, and every deploy basically spun up a new infrastructure from scratch, and once it had passed the automated tests, the load balancers would switch to the new instance. DBs were created from binary snapshots which would then slave off the live DB to catch up (never more than an hour of diff), which also ensured we had a continuously tested DB backup process. The previous instance would get torn down after 8 hours, which was long enough to allow any straggling processes to finish and to have somewhere to roll back to if needed.
This all got stored in the cloud, but also locally in our office, and also written onto a DVD-R, all automatically, all verified each time.
Our absolute worst case scenario would be less than an hour of downtime, less than an hour of data loss.
Similarly our dev environments were a watered down version of the live environment, and so if they were somehow lost, they could be restored in the same manner - and again, frequently tested, as any merge into the preprod branch would trigger a new dev environment to automatically spin up with that codebase.
It takes up-front engineering effort to get in place, but it ended up saving our bacon twice, and made our entire pipeline much easier and faster to manage.
> Modern IT practices don’t really contemplate disaster recovery. Even organisations with strict backup procedures seldom test recovery (most never at all).
I think this is an outdated view. In modern enterprises DR is often one of the most crucial (and difficult) steps in building the whole infra. You select what is crucial for you, you allocate the budget, you test it, and you plan the date of the next test.
However, I'd say it's very rare to do DR of everything. It's terribly expensive and problematic. You need to choose what's really important to you based on defined budgets.
That's a choice that companies make. I've certainly worked at places which don't test DR, while at my current job we do annual DR runs, where we'll bring up a complete production ready environment from scratch to prove that the backups work, and the runbook for doing a restore actually works.
I'm retired now, but the last place I worked estimated it would take months to do a full restore from off site backups, assuming that the data center and hardware were intact. If the data center was destroyed... Longer.
Say what you want about European financial organizations but they are legally obliged to practice their recovery strategies. So every other month production clusters with all user data get teared down in one cloud region and set up in another one over night. This works surprisingly well. I guess they would never do that without the legal requirements.
Everything is quickly strapped together due to teams being understaffed. Preparing infrastructure in a way such that it can easily be recreated is easily twice the effort as “just” setting it up the usual way.