In that example, the banks wouldn't compensate people if their payment system wa...

acdha · 2024-07-22T11:26:57 1721647617

> Do they have tests for what happens when pieces of their systems or infrastructure go down? Did they ever do a disaster recovery drill?

Yes, but this forced them to run the worst DR routine short of Microsoft going rogue. The scale of the testing problem is orders of magnitude larger on one side: people trusted them to be minimally competent and they just weren’t.

giantg2 · 2024-07-23T01:52:27 1721699547

Not going to lie, you sound like you've never made any mistakes ever. The problematic culture is on both sides. Anyone who works in infrastructure, SCCM packaging, or data centers knows you roll out updates to a isolated set of machines to test it first. Trust no one. If they really relied on them that much, they should have penalties in their contracts. If not, that would be an example of not being minimally competent - any enterprise sourcing team would look into this.

acdha · 2024-07-24T12:11:39 1721823099

Nobody does much in ops without making mistakes. I’ve done all of the roles you mentioned and learned from mistakes and oversights, which is why I know that you should start by looking at what you assume is available in each scenario. For example, I’ve twice seen prolonged outages in a data center due to failures in the power distribution equipment when doing a failover test – the physical plant guys had checked the UPS/battery systems but didn’t think about what would happen if that component failed and then learned that spare parts were out of stock in Southern California and the manufacturer had to put someone on the plane from Colorado, which meant we had to roll the failover to another data center. All of us had technically known that the redundant hardware didn’t mean n=2 parts could fail simultaneously but had incorrectly assumed the odds of a correlated failure were much lower or that the vendor a couple miles away would be able to fix a failed unit.

I mention that last because that’s what happened to a lot of people here. They had DR plans assuming that they had their management infrastructure or could quickly bring it back online, but then they had things like CrowdStrike taking out the servers holding BitLocker recovery keys and other critical infrastructure. One of the under-appreciated outcomes from the general push to secure things has been that a lot of systems are now less robust because they depend on a few security critical components with no easy path to recovery if those fail. Full disk encryption is great from the perspective of data loss but it also means key management is mission critical in a way senior management probably wasn’t fully appreciative of when setting funding plans.