Nobody does much in ops without making mistakes. I’ve done all of the roles you ...

Nobody does much in ops without making mistakes. I’ve done all of the roles you mentioned and learned from mistakes and oversights, which is why I know that you should start by looking at what you assume is available in each scenario. For example, I’ve twice seen prolonged outages in a data center due to failures in the power distribution equipment when doing a failover test – the physical plant guys had checked the UPS/battery systems but didn’t think about what would happen if that component failed and then learned that spare parts were out of stock in Southern California and the manufacturer had to put someone on the plane from Colorado, which meant we had to roll the failover to another data center. All of us had technically known that the redundant hardware didn’t mean n=2 parts could fail simultaneously but had incorrectly assumed the odds of a correlated failure were much lower or that the vendor a couple miles away would be able to fix a failed unit.

I mention that last because that’s what happened to a lot of people here. They had DR plans assuming that they had their management infrastructure or could quickly bring it back online, but then they had things like CrowdStrike taking out the servers holding BitLocker recovery keys and other critical infrastructure. One of the under-appreciated outcomes from the general push to secure things has been that a lot of systems are now less robust because they depend on a few security critical components with no easy path to recovery if those fail. Full disk encryption is great from the perspective of data loss but it also means key management is mission critical in a way senior management probably wasn’t fully appreciative of when setting funding plans.