Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In that example, the banks wouldn't compensate people if their payment system was down and couldn't process orders. That really what happened here - loss of sales on the clients' end. If it were securities, then they would have as-of processing to fix it.

But the same argument about lack of testing can be made for the companies using the system. Do they have tests for what happens when pieces of their systems or infrastructure go down? Did they ever do a disaster recovery drill? Honestly, there's blame all around because almost nobody is doing it right. Even the ones who are have failures.




> Do they have tests for what happens when pieces of their systems or infrastructure go down? Did they ever do a disaster recovery drill?

Yes, but this forced them to run the worst DR routine short of Microsoft going rogue. The scale of the testing problem is orders of magnitude larger on one side: people trusted them to be minimally competent and they just weren’t.


Not going to lie, you sound like you've never made any mistakes ever. The problematic culture is on both sides. Anyone who works in infrastructure, SCCM packaging, or data centers knows you roll out updates to a isolated set of machines to test it first. Trust no one. If they really relied on them that much, they should have penalties in their contracts. If not, that would be an example of not being minimally competent - any enterprise sourcing team would look into this.


Nobody does much in ops without making mistakes. I’ve done all of the roles you mentioned and learned from mistakes and oversights, which is why I know that you should start by looking at what you assume is available in each scenario. For example, I’ve twice seen prolonged outages in a data center due to failures in the power distribution equipment when doing a failover test – the physical plant guys had checked the UPS/battery systems but didn’t think about what would happen if that component failed and then learned that spare parts were out of stock in Southern California and the manufacturer had to put someone on the plane from Colorado, which meant we had to roll the failover to another data center. All of us had technically known that the redundant hardware didn’t mean n=2 parts could fail simultaneously but had incorrectly assumed the odds of a correlated failure were much lower or that the vendor a couple miles away would be able to fix a failed unit.

I mention that last because that’s what happened to a lot of people here. They had DR plans assuming that they had their management infrastructure or could quickly bring it back online, but then they had things like CrowdStrike taking out the servers holding BitLocker recovery keys and other critical infrastructure. One of the under-appreciated outcomes from the general push to secure things has been that a lot of systems are now less robust because they depend on a few security critical components with no easy path to recovery if those fail. Full disk encryption is great from the perspective of data loss but it also means key management is mission critical in a way senior management probably wasn’t fully appreciative of when setting funding plans.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: