Honest question: is the title of this article sarcastic?

vivekprakash · on March 14, 2012

No, it's not. May be you don't know MINIX 3 very well ;)

llimllib · on March 14, 2012

I don't know it at all. Minix to me has a reputation as the punch line of a joke about how to design kernels for the real world. (I don't claim that it's true, just that that's what I associate it with.)

Can you sell me on why I should be interested?

zedshaw · on March 14, 2012

That was 20 years ago, things change in that amount of time. It was also a lot of propaganda and not a lot of evidence.

But hey, it's not like programmers need to update their Internal Information Hashmap. Just put something in there once and leave it alone since that thing is delicate and updating it can sometimes crash your mind.

stcredzero · on March 14, 2012

Most youngsters (20's and under) will read this as a random snide comment. Unfortunately it (still) is a key observation about how our knowledge update heuristics lag behind progress in tech.

llimllib · on March 14, 2012

Zed, you realize that I asked because I wanted to change my IIH right?

vivekprakash · on March 14, 2012

"Ten years ago, most computer users were young people or professionals with lots of technical expertise. When things went wrong – which they often did – they knew how to fix things. Nowadays, the average user is far less sophisticated, perhaps a 12-year-old girl or a grandfather. Most of them know about as much about fixing computer problems as the average computer nerd knows about repairing his car. What they want more than anything else is a computer that works all the time, with no glitches and no failures. Many users automatically compare their computer to their television set. Both are full of magical electronics and have big screens. Most users have an implicit model of a television set: (1) you buy the set; (2) you plug it in; (3) it works perfectly without any failures of any kind for the next 10 years. They expect that from the computer, and when they do not get it, they get frustrated. When computer experts tell them: "If God had wanted computers to work all the time, He wouldn't have invented ‘Reset’ buttons" they are not impressed.

For lack of a better definition of dependability, let us adopt this one: A device is said to be dependable if 99% of the users never experience any failures during the entire period they own the device. By this definition, virtually no computers are dependable, whereas most TVs, iPods, digital cameras, camcorders, etc. are. Techies are willing to forgive a computer that crashes once or twice a year; ordinary users are not. Home users aren't the only ones annoyed by the poor dependability of computers. Even in highly technical settings, the low dependability of computers is a problem. Companies like Google and Amazon, with hundreds of thousands of servers, experience many failures every day. They have learned to live with this, but they would really prefer systems that just worked all the time. Unfortunately, current software fails them.

The basic problem is that software contains bugs, and the more software there is, the more bugs there are. Various studies have shown that the number of bugs per thousand lines of code (KLoC) varies from 1 to 10 in large production systems. A really well-written piece of software might have 2 bugs per KLoC over time, but not fewer. An operating system with, say, 4 million lines of code is thus likely to have at least 8000 bugs. Not all are fatal, but some will be. A study at Stanford University showed that device drivers – which make up 70% of the code base of a typical operating system – have bug rates 3x to 7x higher than the rest of the system. Device drivers have higher bug rates because (1) they are more complicated and (2) they are inspected less. While many people study the scheduler, few look at printer drivers.

The Solution: Smaller Kernels

The solution to this problem is to move code out of the kernel, where it can do maximal damage, and put it into user-space processes, where bugs cannot cause system crashes. This is how Minix 3 is designed."

From http://www.linux-magazine.com/Issues/2009/99/Minix-3

rapind · on March 14, 2012

I really do hope that user's demand more reliability from their computers (and various computing devices). However I believe that since the birth of the PC we've been training user's to tolerate a much higher rate of failure and a massive backlash is unlikely.

People have varying tolerance levels depending on what they're using. We have an insanely low tolerance level for jet failure (the safety checks and expense that goes into airfare is extremely high) due to the public nature of the failures. We have higher tolerance level for car failures even though they claim the lives of far more people every year. We have an extremely high tolerance level for personal computer failure.

I'd like to be wrong. Contrary to your statement, I find myself, as a techy, to be far more critical of computer failure than the average user. I will discontinue use of poorly written software much quicker than my non-techy family or friends.

sliverstorm · on March 14, 2012

The high tolerance for PC failure is practical and logical. Failure doesn't generally cost a whole lot compared to cars and jet planes, and the upside to being tolerant of failure is a greatly accelerated pace of development.

It's just another classic risk/reward tradeoff. End users tolerate more risk from computers in exchange for the benefits.

geoffschmidt · on March 15, 2012

> the average user is far less sophisticated, perhaps a 12-year-old girl

Argh. The author just had to specify that the unsophisticated 12-year-old is a girl. Because, hey, a 12-year-old boy might be a larval hacker, right?

> or a grandfather

Old people is another category of people who hopelessly "unlike" the presumed Linux Magazine reader. They certainly aren't interested in microkernels, but let's make sure they feel suitably old and marginalized if they ever try to change that.

tedunangst · on March 14, 2012

microkernel doesn't do much to solve this problem. "A device is said to be dependable if 99% of the users never experience any failures". Users don't care if the kernel doesn't crash. If the driver crashes, then the user still experiences a device failure, since a device without a functional driver is not functional.

hetman · on March 14, 2012

As mentioned by DennisP (but I can't reply on his post for some reason), one of the design goals of Minix is to have drivers seamlessly restarted so the user can continue uninterrupted.

tedunangst · on March 14, 2012

The notion that drivers can just seamlessly restart is as much a fairy tale as the bug free monolithic kernel. What does your filesystem do when the disk driver crashes? What does your app do? You're fucked all the way up the stack. Complex operations are going to smear their state across a variety of modules. Net result: you only have one big module.

maaku · on March 14, 2012

I guess that magic pixie dust must be a secret ingredient in HP's NonStop* architecture (runs air traffic control, stock exchanges, etc.)? I suggest actually taking a look at Minix 3, and other fault tolerant operating systems. Disk drivers infecting filesystems is a disease of the monolithic PC world.

* I have a friend who was an engineer for Tandem (now HP) in the 90's. They tested their servers in a demonstration for the government/defense department by taking them to a shooting range and running a benchmark test while firing indiscriminately with automatic weaponry. The story goes that the transaction processing declined precipitously as chips, blades, and motherboards were shattered. It went from millions, to thousands, to just a few dozen transactions per second with no data loss when a bullet clipped the serial jack they were using to log the benchmark. They got a very large order afterwards from the government/military.

I don't know if it actually happened (a Google search doesn't show anything), but having been shown by him the redundancy built into all levels of their architecture, and heard the stories about real failures in exchanges, air traffic control, and other critical never-turn-off deployments they do, I believe it could have. Reliable computing is possible.

tedunangst · on March 14, 2012

Whatever magic pixie dust is in minix, I'm pretty sure it's not going to suddenly make redundant CPUs sprout up in my laptop. You're talking about something else entirely. I could just as easily say that if half of Google's data centers were nuked, they could still serve searches, just slower, and therefore prove linux is utterly reliable.

Anyway, if you like anecdotes, I saw with my very own eyes the network cable between two OpenBSD firewalls chopped with an axe to no detrimental effect. So there. Monolithic kernels are superior to motherfucking axes.

bandy · on March 14, 2012

The less-destructive version of this demonstration when I first encountered one in the early 80s was for someone to walk up to the machine, open a cabinet, and randomly pull out a (coffee table book sized) card. No magic smoke, no screams of anguish, no sudden chatter from the console printing messages of lament from the operating system.

emmelaich · on March 15, 2012

I managed Tandem Nonstops and also Stratus FX machines. Multiple redundant hardware paths, mirrored ram etc.

God they were awful. The conservatism of design meant that although the hardware was fine and redundant and reliable, the software was crap; user hostile and buggy.

They would have been far better off making reliable clusters rather than make a machine internally redundant.

And expensive. Something around a million dollars for a 75 MHz machine (Stratus) in 1997.

strictfp · on March 14, 2012

I agree with tedunangst, it's really a game of all or nothing. I cannot think of any apps which acheive high stability by systematic fault recovery. Fault recovery is nice in itself, but it is never a good strategy for stability. Good code quality is.

DennisP · on March 14, 2012

Minix3 monitors the drivers and restarts them if they crash.

http://www.minix3.org/other/reliability.html

ketralnis · on March 14, 2012

If your goal is whole-system reliability, there's way more low-hanging fruit than the kernel. In the last five years I've had two kernel panics. It's so rare that I remember both times it's happened. But hardware failures (at least one hardware failure of some kind a year) and application crashes (once a week or more) happen all of the time. Hardware is a complex beast but many application crashes are significantly improvable by some other low-hanging fruit (say, better crash reporting for developers)

maaku · on March 14, 2012

But if your goal is absolute, near-100% reliability, you will eventually have to do something about the kernel.

Also, if you look at the design of Minix 3, they do address many of the concerns you mention. There's an infrastructure for checkpointing applications, and a “resurrection” server that acts as a configurable watchdog service for the entire software stack from device drivers to web servers.

The real goal of the microkernel architecture is to make these watchdog services as reliable as possible (there's only a few thousand lines of heavily audited code running beneath them). That, combined with user-space device drivers (so faulty hardware or driver code doesn't bring down the whole system) would address most of your concerns.

No surprise, that's the path they are headed down. I even see that this release includes a "block device fault injection driver" for simulating hardware failures.

dfc · on March 14, 2012

Is minix competing with things like VxWorks and QNX? I did a cursory google/wikipedia search for minix uses and the results were all for education.

vivekprakash · on March 14, 2012

Disclaimer: I am not an official member of Minix team. However, given my understanding, it certainly looks Minix is going to compete with QNX, VxWorks and similar OS used in embedded systems. Visit http://wiki.minix3.org/en/MinixGoals and http://wiki.minix3.org/en/MinixRoadmap for details.