Sometimes Kill -9 Isn't Enough

rdtsc · on Nov 13, 2014

Was going to make a pun on the title "... because uninterruptable sleep is a bitch", but it doesn't talk about that.

Going back to the topic there are great points there. Remember discovering "tc qdisc" and playing with it. Really nice tool.

But another thing to learn perhaps, is to try to avoid the gray zone by going to either the "black zone" = dead, or "white zone" = working fine. That is, if a node/process/VM/disk start showing signs of failure above a threshold, something else should kill/disable it or restart it.

Think of it as trying to go to stable known states. "Machine is up, running, serving data, etc", "Machine is taken offline". If you can try to avoid in-between "gray states" -- "Some processes are working, some are not", "swap is full and running out of memory, oomkiller is going to town, some some services kinda work" and so on. There are just too many degrees of freedom and it is hard to test against all of them. Obviously somethings like network issues cannot be fixed with a simple restart so those have to be tested.

jaggederest · on Nov 13, 2014

This is a design value in Erlang - fail the process, let the supervisor restart it, rather than handling a lot of specific edge case failures. I haven't done much Erlang programming for a while (~decade), but it was one of the things I really appreciated about it.

Rapzid · on Nov 13, 2014

I totally thought this was going to talk about non-interruptible process states. Like the dreaded D. D is for "your reboot will fail, hope you have ILO".

anon4 · on Nov 13, 2014

Oh man, I hate that.

I've dreamed of patching the kernel and writing two utilities - twim (terminate without mercy) and uwep (unmount with extreme prejudice) that simply remove a process along with all threads, or destroy a mountpoint and drop all associated resources (all filehandles become closed, etc.). Lack of time has mostly stopped me from attempting it, and I'm quite sure it won't be at all trivial.

Rapzid · on Nov 14, 2014

Yeah.. Not sure if the root cause was ever determined but at my previous job we had issues with Xen guests shutting down but the blkback device would go D and never quit. This would prevent the VM from starting because the LV was busy. lvm commands would freeze. And of course the system would end up needing a hard reboot because the lvm teardown scripts would not complete on shutdown due to the busy device. Good times :|

danielweber · on Nov 13, 2014

10 years ago I'd have linux sound drivers that wouldn't respond to kill -9. If I unplugged or plugged in while sound was playing, I'd need to reboot if I wanted sound again.

artursapek · on Nov 13, 2014

"Comcast" is pretty hilarious. https://github.com/tylertreat/Comcast

masklinn · on Nov 13, 2014

Note that OSX has an apple-provided Network Link Conditioner to configure bandwidth/delay/drop. Even better, it's built into iOS devices set up for development.

rwiggins · on Nov 13, 2014

If you'd like to simulate network crappiness on OS X, you can use the Network Link Conditioner from Apple themselves: http://nshipster.com/network-link-conditioner/

I was very impressed with its feature-set (for what it is). On our team, we use it to see how our iOS app will react to severe network problems (via testing in the simulator, mostly, though it's also available on iOS devices themselves as explained in the above article).

0xbadcafebee · on Nov 13, 2014

This is the "I don't know how my network works, so let's throw a wrench into the works and see what happens, fix it, rinse, repeat" form of network and systems engineering. It's certainly useful at various points in tuning performance, but it doesn't replace actually designing your system to resist these problems to begin with.

Even if you introduce these network performance issues, the results are meaningless if you don't have instrumentation ready to capture metrics on the results throughout the network/systems. Everyone wants to write about what happened when they partitioned their network. But you notice how nobody writes about the netflows, the taps, the service monitors, the interface stats, the app performance stats, the query run times, host connection state stats, miscellaneous network error stats, transaction benchmark stats, and hundreds of other data sources that are required to analyze the resulting network congestion.

To me it's much more vital that I can correlate events to track down an issue in real-time. You will never be able to identify all possible failure types by making random things fail, but you can improve the process by which you identify a random problem and fix it quickly.

GuiA · on Nov 13, 2014

kill -9, no more CPU time

https://m.youtube.com/watch?v=Fow7iUaKrq4

eosis · on Nov 13, 2014

You have to be careful using iptables dropping rules on the OUTPUT tables, as this manifests itself ( at least on our systems ) as failed send socket calls (which are often retried by the application), rather than true packetloss. Netem tends to work as expected.

zorbo · on Nov 13, 2014

This focuses mostly on simulating unreliabable networking. Is there a tool, perhaps some LD_PRELOAD wrapper, that can simulate unreliable everything? I'm talking memory errors, disks going away, fake high I/O load, etc?

I once wrote a library for python that injected itself into the main modules (os, sys, etc) and generated random failures all over the place. It worked very well for writing reliable applications, but it only worked for pure python code. I don't own the code, so I can't open source it unfortunately.

groks · on Nov 13, 2014

http://cwrap.org/

tlarkworthy · on Nov 13, 2014

I recognise those commands ...

http://stackoverflow.com/questions/614795/simulate-delayed-a...

I am still trying to work out how I not knobble my DB connection when trying to simulate client errors on a single dev machine.

jaggederest · on Nov 13, 2014

You can specify rules for a single port, I believe, using ipfw or equivalent.

noonespecial · on Nov 13, 2014

Brings back horrible memories of writing tc scripts to simulate VSAT and rural dsl back in the bad old days. We bundled them up on a Soekris box and called it the "DSLow" (as in DSL-oh) box.

mu_killnine · on Nov 13, 2014

I find this article offensive ;)