Having worked mostly in cowboy organizations, I just don't see how you can fix t...

throwaway894345 · on Dec 3, 2021

> Having worked mostly in cowboy organizations, I just don't see how you can fix tricky network issues without access to the machines. And once you need access sometimes, you might as well use it always, because it's certainly handy.

The big, abstract idea is that you don't worry about individual instances (be they VMs or container instances or unikernel apps or etc), but you worry about the process for stamping out those instances. If you have a process for reproducibly creating machines, then you can test that process in lower environments (stamp out some instances, validate them, and destroy them) before promoting that process to production. When you have production issues, your first line of defense should be comprehensive logging and monitoring, but failing that you try to reproduce issues in lower environments and failing that you can deploy a version of your app with monitoring tools baked in (e.g., instead of a scratch base image, you deploy a version with a full ubuntu base image). If you're running into issues so often that you feel the need to deploy your debug tools to production all the time, then you're not appropriately investing in your logging/monitoring (ideally every time you need a debug tool in production, you should go back and add equivalent instrumentation so you don't need that debug tool any more). This is the theory, anyway. Practice involves a lot more nuance and judgement.

toast0 · on Dec 3, 2021

Yeah, my test environment is never going to match the diversity of the real world. I just don't have the equipment or imagination to ad-hoc create the garbage networks the real world has.

Running a new, clean instance doesn't help most networking problems that are an interaction between (usually) kernel bugs in different systems.

How do you figure out your kernel is sending way too many packets out when it gets an ICMP MTU exceeded with the MTU set to the current size without getting lucky with tcpdump. That's a FreeBSD kernel bug/oversight (since fixed) interacting with a Linux routing bug (already fixed when I found it, nut not deployed) where large receive offload resulted in a packet too large to forward even though the on the wire packets were properly sized.

Or that time where syncookies were broken and a very short connection could get restarted but not properly synchronized and both sides send challenge acks for all packets they receive until one side times out (up to line rate if the other party has good connectivity or is localhost). That one needs luck with tcpdump too.

Or when ip fragment processing had a nead infinite loop if you had the right timing, resulting in that rx queue getting hung for hours. Dropping into the kernel debugger made it very quick to debug.

It's quite hard to diagnose MTU issues in general without tcpdump, too. No sane person is going to log all packets for a machine that's pushing 2-20Gbps, or the retention is going to be too short to be useful.

Tcpdump could maybe be handled by port mirroring on a switch, but repeating unknown event triggers partial network outage took seconds to diagnose with the tools on production and would have taken an unknowable amount of time otherwise. Upstream fixed it by accident, and few people would have experienced it because it was unlikely to occur with default settings.

I fix a lot of problems with tcpdump, so all of my problems look tcpdump shaped.

eyberg · on Dec 3, 2021

Just as an aside both the use-cases of MTU being the wrong size (GCP, which I mentioned in the article) and adding syn cookies are both direct problems that we've had to diagnose and fix in Nanos. The latter cause the tcp/ip stack we chose didn't have syn cookies to begin with.

We were able to use tcpdump in that situation as well to test and verify the problem was solved and didn't require it to be in the guest.

All of these problems I wouldn't expect an average user to fix regardless though.

toast0 · on Dec 3, 2021

> All of these problems I wouldn't expect an average user to fix regardless though.

I wouldn't expect an average user to run a unikernel.

I'd expect those running unikernels in production to be people who really need the performance, and those who really need the performance are going to be running up against weird issues all the time, and need to have at least a couple people on staff who can solve weird issues. When you have that kind of scale, one in a million events are frequent.

I also don't necessarily understand the desire to run a unikernel in a hypervisor; might it be better to run a regular kernel on bare metal, or the unikernel on bare metal (yeah, it's harder to develop and probably debug the unikernel that way, but performance!).

Maybe isolation is a bigger deal to some people than me.