Was going to make a pun on the title "... because uninterruptable sleep is a bitch", but it doesn't talk about that.
Going back to the topic there are great points there. Remember discovering "tc qdisc" and playing with it. Really nice tool.
But another thing to learn perhaps, is to try to avoid the gray zone by going to either the
"black zone" = dead, or "white zone" = working fine. That is, if a node/process/VM/disk start showing signs of failure above a threshold, something else should kill/disable it or restart it.
Think of it as trying to go to stable known states. "Machine is up, running, serving data, etc", "Machine is taken offline". If you can try to avoid in-between "gray states" -- "Some processes are working, some are not", "swap is full and running out of memory, oomkiller is going to town, some some services kinda work" and so on. There are just too many degrees of freedom and it is hard to test against all of them. Obviously somethings like network issues cannot be fixed with a simple restart so those have to be tested.
This is a design value in Erlang - fail the process, let the supervisor restart it, rather than handling a lot of specific edge case failures. I haven't done much Erlang programming for a while (~decade), but it was one of the things I really appreciated about it.
I totally thought this was going to talk about non-interruptible process states. Like the dreaded D. D is for "your reboot will fail, hope you have ILO".
I've dreamed of patching the kernel and writing two utilities - twim (terminate without mercy) and uwep (unmount with extreme prejudice) that simply remove a process along with all threads, or destroy a mountpoint and drop all associated resources (all filehandles become closed, etc.). Lack of time has mostly stopped me from attempting it, and I'm quite sure it won't be at all trivial.
Yeah.. Not sure if the root cause was ever determined but at my previous job we had issues with Xen guests shutting down but the blkback device would go D and never quit. This would prevent the VM from starting because the LV was busy. lvm commands would freeze. And of course the system would end up needing a hard reboot because the lvm teardown scripts would not complete on shutdown due to the busy device. Good times :|
10 years ago I'd have linux sound drivers that wouldn't respond to kill -9. If I unplugged or plugged in while sound was playing, I'd need to reboot if I wanted sound again.
Going back to the topic there are great points there. Remember discovering "tc qdisc" and playing with it. Really nice tool.
But another thing to learn perhaps, is to try to avoid the gray zone by going to either the "black zone" = dead, or "white zone" = working fine. That is, if a node/process/VM/disk start showing signs of failure above a threshold, something else should kill/disable it or restart it.
Think of it as trying to go to stable known states. "Machine is up, running, serving data, etc", "Machine is taken offline". If you can try to avoid in-between "gray states" -- "Some processes are working, some are not", "swap is full and running out of memory, oomkiller is going to town, some some services kinda work" and so on. There are just too many degrees of freedom and it is hard to test against all of them. Obviously somethings like network issues cannot be fixed with a simple restart so those have to be tested.