Modern IT practices don’t really contemplate disaster recovery. Even organisatio...

9dev · 2025-07-16T11:09:48 1752664188

Actually I think this is hard to properly implement. If you're a small shop, really setting up backups with redundancies, writing the documentation, and testing disaster recovery, that's so much more work than people anticipate, and it has implications on all areas of the business, not just IT. So usually it's hard to justify to management why you would put in all that work and slow down operations—which leads to everyone postponing it.

Either that bites you sooner or later, or you're lucky and grow; suddenly, you're a larger organisation, and there are way too many moving parts to start from scratch. So you do a half-hearted attempt of creating a backup strategy held together by duct-tape and hope, that kinda-sorta should work in the worst case, write some LLM-assisted documentation that nobody ever reads, and carry on. You're understaffed and overworked anyway, people are engaging in shadow IT, your actual responsibilities demand attention, so that's the best you can do.

And then you've grown even bigger, you're a reputable company now, and then the consultants and auditors and customers with certification requirements come in. So that's when you actually have to put in the work, and it's going to be a long, gruesome, exhausting, and expensive project. Given, of course, that nobody fucks up in the mean time.

prmph · 2025-07-16T16:10:55 1752682255

Indeed. Setting up infrastructure properly and documenting it properly is even more complex than coding, to me.

I can go back to code I wrote months or years ago, and assuming I architectured and documented it idiomatically, I takes me only a bit of time to start being able to reason about it effectively.

With infrastructure is it a whole different story. Within weeks of not touching it (which happens if it just works) I start to have trouble retaining a good mental model of it. if I have to dig into it, I'll have to spend a lot of time getting re-acquainted with how it all fits together again.

macintux · 2025-07-16T18:29:05 1752690545

As much as Cloudformation and Terraform annoy me (thankfully I’ve never been burdened with k8s) there is something magical about having your infrastructure captured in code.

andrelaszlo · 2025-07-16T10:59:57 1752663597

Just the other day one of my clients had a production critical server failing and we started restoring it from backups.

Turns out some of the software running on it had some weird licensing checks tied to the hardware so it refused to start on the new server.

It turns out that the company that made this important piece of software doesn't even exist anymore.

tialaramex · 2025-07-16T11:45:36 1752666336

Virtualization really helps. We have a lot of weird software which requires hardware dongles, but they're all USB dongles and they're all virtualized, one of the DC racks has a few U worth of just USB socket -> dongle wired up so that if we spin up a VM it can say "Hey, give me a USB socket with a FooCorp OmniBloat dongle on it" and get one unless they're all used.

2YwaZHXV · 2025-07-16T18:12:14 1752689534

would certainly be interested to learn more about this

15155 · 2025-07-16T15:54:01 1752681241

> Turns out some of the software running on it had some weird licensing checks tied to the hardware so it refused to start on the new server.

This is around the time when you call that one guy on your team that can reverse engineer and patch out the license check.

aspenmayer · 2025-07-16T20:41:42 1752698502

Interoperability exception might allow this in exigent circumstances when you do have a valid license, but I wouldn’t do this without running it by the software vendor whose license you are using. In a recovery situation, you’ll probably need to be on the phone a lot, so I can see how you might think it’s quicker to bypass the license check, but that is one person giving some or all of their attention just to that. Disaster recovery isn’t a one person job unless that one person was the whole team anyway, so I think this idea needs to be calibrated somewhat to expectations.

BrandoElFollito · 2025-07-16T12:06:55 1752667615

This is a nightmare kind of discovery. I had a similar one, but fortunately, it wasn't as impactful.

This is why I like docker, if you keep the sources, you recover no matter what (at least until the "no matter what" holds water)

znpy · 2025-07-16T15:01:06 1752678066

> This is why I like docker,

my understanding is that docker would not have helped in that scenario

BrandoElFollito · 2025-07-16T15:12:00 1752678720

it really depends on the scenario but if the application was dockerized and they had an image, it would be just starting it again, somewhere else.

Possibly with the same network settings if the licensing check was based on that.

But of course it can easily go south, though testing the recovery of a container based off an image and mounted volume is simple and quickly shows you if it works or not.

But of course it may work today but not tomorrow because the software was not ready for Y2K and according to it we are in the XX century or something and the license is 156 years ... young. Cannot allow this nonsense to proceed, call us at <defunct number>

IT is full of joy and happiness

znpy · 2025-07-16T15:52:25 1752681145

> it really depends on the scenario

yeah and that scenario was clear:

> Turns out some of the software running on it had some weird licensing checks tied to the hardware so it refused to start on the new server.

BrandoElFollito · 2025-07-16T15:57:32 1752681452

"hardware" does not mean "bare metal". It could be a MAC, a serial number or similar things that may be linked to a generic or clonable value in virtualization.

znpy · 2025-07-17T10:15:02 1752747302

but docker isn't virtualization, you understand this, right ?

BrandoElFollito · 2025-07-17T11:47:50 1752752870

To some extent, yes -- having developed apps that were dockerized, and having managed virtualization systems (ESXi and similar), as well as docker engines.

I am not sure I see your point, though.

madaxe_again · 2025-07-16T11:27:42 1752665262

If you’re doing it right, the DR process is basically the deployment process, and gets tested every time you do a deployment. We used chef, docker, stored snapshot images, and every deploy basically spun up a new infrastructure from scratch, and once it had passed the automated tests, the load balancers would switch to the new instance. DBs were created from binary snapshots which would then slave off the live DB to catch up (never more than an hour of diff), which also ensured we had a continuously tested DB backup process. The previous instance would get torn down after 8 hours, which was long enough to allow any straggling processes to finish and to have somewhere to roll back to if needed.

This all got stored in the cloud, but also locally in our office, and also written onto a DVD-R, all automatically, all verified each time.

Our absolute worst case scenario would be less than an hour of downtime, less than an hour of data loss.

Similarly our dev environments were a watered down version of the live environment, and so if they were somehow lost, they could be restored in the same manner - and again, frequently tested, as any merge into the preprod branch would trigger a new dev environment to automatically spin up with that codebase.

It takes up-front engineering effort to get in place, but it ended up saving our bacon twice, and made our entire pipeline much easier and faster to manage.

benterix · 2025-07-16T11:00:37 1752663637

> Modern IT practices don’t really contemplate disaster recovery. Even organisations with strict backup procedures seldom test recovery (most never at all).

I think this is an outdated view. In modern enterprises DR is often one of the most crucial (and difficult) steps in building the whole infra. You select what is crucial for you, you allocate the budget, you test it, and you plan the date of the next test.

However, I'd say it's very rare to do DR of everything. It's terribly expensive and problematic. You need to choose what's really important to you based on defined budgets.

rimbo789 · 2025-07-16T11:08:07 1752664087

Budgets - and lowering them - win every time. I do budgeting and forecasting for SaaS companies and this kind of work is always the first cut

edoceo · 2025-07-16T12:38:37 1752669517

Is there a recurring theme for why? There is huge risk exposure.

supertrope · 2025-07-16T12:56:04 1752670564

People round down small risk to zero risk. Meanwhile the cost to run a full DR drill is a certain and immediate cost to their budget.

jon-wood · 2025-07-16T11:24:33 1752665073

That's a choice that companies make. I've certainly worked at places which don't test DR, while at my current job we do annual DR runs, where we'll bring up a complete production ready environment from scratch to prove that the backups work, and the runbook for doing a restore actually works.

amanaplanacanal · 2025-07-16T16:45:00 1752684300

I'm retired now, but the last place I worked estimated it would take months to do a full restore from off site backups, assuming that the data center and hardware were intact. If the data center was destroyed... Longer.

readlikeasloth · 2025-07-17T11:45:10 1752752710

Say what you want about European financial organizations but they are legally obliged to practice their recovery strategies. So every other month production clusters with all user data get teared down in one cloud region and set up in another one over night. This works surprisingly well. I guess they would never do that without the legal requirements.

readthenotes1 · 2025-07-16T13:37:45 1752673065

I used to find it amusing how many people thought Backup was a requirement.

"No, Restore is" I would say to stunned faces...