> There is just no substitute for understanding how everything works. Great line...

lawnchair · on Nov 17, 2019

Agree 100%. I'm an SRE team of one and I'm looking for an additional team member. I've interviewed about a dozen people so far and one thing I've noticed is that a lot of young engineers do not know the basics. They really can't tell me how to log into a Linux box and troubleshoot. I think that's a shame. You really lose a lot of insight if you don't understand how the underlying pieces work.

cfors · on Nov 17, 2019

An interesting side-effect of having awesome tooling. When you have a service that is taking too much memory and can see directly on a Grafana dashboard that the reason your service keeps getting OOM'd because of too many Goroutines you can avoid even having to go into a box to debug.

When I used to be on call, I remember very distinctly an incident where there was a ton of restarts on containers for a set of nodes (This was on K8s, so not immediately obvious). The operations team mentioned to me that in the logs (K8s logs here) there was a lot of errors being generated along the lines of "Discovery failing, restarting".

I asked if we had a tcpdump if we could see what was happening, and the answer I got was "There is no tcpdump in our container." Meaning that we didn't have the binary in the actual container image.

For anyone with any knowledge of Kubernetes, you know that you can easily just SSH to the machine running the pod, get the current port it's being ran on and then run a tcpdump on there. However, the fact that all of this couldn't be tied together to come up with a tcpdump to understand that caused this issue to persist for an extra 3 hours when the issue was a flapping NIC that was misconfigured.

Without understanding how your systems actually work under the hood, you will be running them with your hands tied. It's not good enough to understand one abstraction, you need to be prepared at anytime to peel back the layers of abstraction and get your hands dirty in a different layer.

t34543 · on Nov 17, 2019

Even worse I joined a new company in a senior technical role and it’s seen as a negative that I prefer ssh/strace/tcpdump to debug problems.

ownagefool · on Nov 17, 2019

It arguably is.

A good SRE needs to understand systems, as in automation of n computers. Focusing on and prefering single system tools where you have to take manual action points to an immaturity in dealing with complex distributed systems.

However, it's a common problem and most of the folks buying complicated distributed tracing systems don't have particular skills in using them either, so your skills are valuable, even if there could be better ways to do it.

Similarly if you focus on hiring SREs who know shell commands well, you might lose more pertinent skills such as knowing what terraform is actually doing, general programming skills, CI/CD and an understanding of cloud APIs.

Horses for courses; the more we know the better. Look for both sets of skills in your teams and cross train as much as possible.

t34543 · on Nov 17, 2019

Arguably those who demonstrate understanding of low level fundamentals have the drive to understand as many layers as possible.

Capturing AWS API calls through sslproxy not only implies you know what terraform is doing but you also have a higher probability of solving difficult problems.

All code boils down to an execution layer and having inspection ability at that layer will always be valuable.