Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It really depends on the size of the system and the definition of fault tolerance. If I have a website calling out to 10 APIs and one API failure takes down the site, that is not fault tolerance. If that 1 API failure gets caught and the rest operate as normal, that is fault tolerance, but 10% of the system is down. If you go to almost any site and open the dev console, you'll see errors coming from parts of the system, that is fault tolerance. Any twin engine airplane is fault tolerant...until both engines fail. I would say the solar system is fault tolerant, the universe even moreso if you consider it a system.

tldr there are levels to fault tolerance and I understand what you are saying. I am not sure if you are advocating for getting rid of fault handling, but generally you can mitigate the big scary monsters and what is left is the really edge case issues, and there really is no stopping one of those from time to time given we live in a world where anything can happen at anytime.

This instance really seems like a human related error around deployment standards...and humans will always make mistakes.




well, you usually put a load balancer and multiple instances of your service to handle individual server failures. In a basic no-lb case, your single server fails, you restart it and move on (local failure). In a load balancer case, your lb introduces its own global risks e.g. the load balancer can itself fail, which you can restart, but the load balancer can have a bug and stop handling sticky sessions when your servers are relying on it, and now you have a much harder to track brown-out event that is affecting every one of your users for a longer time, it's hard to diagnose, might end up with hard to fix data issues and transactions, and restarting the whole might not be enough.

So yeah, there is no fault tolerance if the timeframe is large enough, there are just less events, with much higher costs. It's a tradeoff.

The cynical in me thinks that the one advantage of these complex CYA systems is that when systems fail catastrophically like CrowdStrike did, we can all "outsource" the blame to them.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: