As mentioned by DennisP (but I can't reply on his post for some reason), one of ...

tedunangst · on March 14, 2012

The notion that drivers can just seamlessly restart is as much a fairy tale as the bug free monolithic kernel. What does your filesystem do when the disk driver crashes? What does your app do? You're fucked all the way up the stack. Complex operations are going to smear their state across a variety of modules. Net result: you only have one big module.

maaku · on March 14, 2012

I guess that magic pixie dust must be a secret ingredient in HP's NonStop* architecture (runs air traffic control, stock exchanges, etc.)? I suggest actually taking a look at Minix 3, and other fault tolerant operating systems. Disk drivers infecting filesystems is a disease of the monolithic PC world.

* I have a friend who was an engineer for Tandem (now HP) in the 90's. They tested their servers in a demonstration for the government/defense department by taking them to a shooting range and running a benchmark test while firing indiscriminately with automatic weaponry. The story goes that the transaction processing declined precipitously as chips, blades, and motherboards were shattered. It went from millions, to thousands, to just a few dozen transactions per second with no data loss when a bullet clipped the serial jack they were using to log the benchmark. They got a very large order afterwards from the government/military.

I don't know if it actually happened (a Google search doesn't show anything), but having been shown by him the redundancy built into all levels of their architecture, and heard the stories about real failures in exchanges, air traffic control, and other critical never-turn-off deployments they do, I believe it could have. Reliable computing is possible.

tedunangst · on March 14, 2012

Whatever magic pixie dust is in minix, I'm pretty sure it's not going to suddenly make redundant CPUs sprout up in my laptop. You're talking about something else entirely. I could just as easily say that if half of Google's data centers were nuked, they could still serve searches, just slower, and therefore prove linux is utterly reliable.

Anyway, if you like anecdotes, I saw with my very own eyes the network cable between two OpenBSD firewalls chopped with an axe to no detrimental effect. So there. Monolithic kernels are superior to motherfucking axes.

bandy · on March 14, 2012

The less-destructive version of this demonstration when I first encountered one in the early 80s was for someone to walk up to the machine, open a cabinet, and randomly pull out a (coffee table book sized) card. No magic smoke, no screams of anguish, no sudden chatter from the console printing messages of lament from the operating system.

emmelaich · on March 15, 2012

I managed Tandem Nonstops and also Stratus FX machines. Multiple redundant hardware paths, mirrored ram etc.

God they were awful. The conservatism of design meant that although the hardware was fine and redundant and reliable, the software was crap; user hostile and buggy.

They would have been far better off making reliable clusters rather than make a machine internally redundant.

And expensive. Something around a million dollars for a 75 MHz machine (Stratus) in 1997.

strictfp · on March 14, 2012

I agree with tedunangst, it's really a game of all or nothing. I cannot think of any apps which acheive high stability by systematic fault recovery. Fault recovery is nice in itself, but it is never a good strategy for stability. Good code quality is.