Hugops to the people working on this for the last 31+ hours.
Running incidents of this significance is hard, draining and requires a lot of effort, this going on for so long must be very difficult for all involved.
Running software in an airgapped environment is difficult, but the hardest thing is the install, packaging and shipping updates. I have used https://zarf.dev/ to do this for a government client, and it was an amazing experience. I highly recommend it. K8s seems heavy, but if you want to run datastores with backups (k8s operators), or highly customised environments, and automate all of that, instead of loads of bash and custom code, it shines.
If you carry on reading, its quite obvious they misconfigured a service and routed production traffic to that instead of the correct service, and the system used to do that was built in 2018 and is considered legacy (probably because you can easily deploy bad configs). Given that, I wouldn't say the summary is "inscrutable corporatese" whatever that is.
It's carefully written so my boss's boss thinks he understands it, and that we cannot possibly have that problem because we obviously don't have any "legacy components" because we are "modern and progressive".
It is, in my opinion, closer to "intentionally misleading corporatese".
Joe Shmo committed the wrong config file to production. Innocent mistake. Sally caught it in 30 seconds. We were back up inside 2 minutes. Sent Joe to the margarita shop to recover his shattered nerves. Kid deserves a raise. Etc.
I have run ELK, Grafana + Prom, Grafana + Thanos/Coretex, New relic and all of the more traditional products for monitoring/observability. More recently in the last few years, I have been running full observability stacks via either The Grafana LGTM stack or datadog at a reasonable scale and complexities. Ultimately you want one tool that can alert you off a metric, present you some traces, and drill down into logs, all the way down the stack.
I have found Datadog to be, by far hands down the best developer experience from the get go, the way it glues the mostly decent products together is unparalleled in comparison to other products (Grafana cloud/LGTM). I usually say if your at a small to medium scale business just makes sense, IF you understand the product and configure it correctly which is reasonably easy.
The seamless integration between tracing, logging and metrics in the platform, which you can then easily combine with alerts is great. However, its easy to misconfigure it and spend a lot of money on seemingly nothing. If you do not implement tracing and structured logs (at the right volume and level) with trace/span ids etc all the way through services its hard to see the value, and seems expensive. It requires some good knowledge, and configuration of the product to make it pay off.
The rest of the product features are generally good, for example their security suite is a good entry level to cloud security monitoring and SEIM too.
However, when you get to a certain scale, the cost of APM and Infrastructure hosts in Datadog can become become somewhat prohibitive. Also, Datadogs custom metrics pricing is somewhat expensive and its query language cababilities does not quite match the power of promql, and you start to find yourself needed them to debug issues. At that point, the self hosted LGTM stack starts to make sense, however, it involves a lot more education for end users in both integration (a little less now Otel is popular) and querying/building dashboards etc, but also running it yourself. The grafana cloud platform is more attractive though.
My experience mirrors yours wrt Datadog. It's incredible value at low scale, you get a full robust system with great devex for pennies. Once you hit that tipping point though, you are locked in pretty hardcore. Datadog snakes its way far into your codebase, with all the custom tracing and stuff like that. Migrating off of it is a very expensive endeavor, which is probably one of the reasons why they are such a money printing operation.
I think "medium scale" is probably more appropriate. For a $3M~$5M revenue SaaS you're still paying $50k+/year. That's not nothing for a small owner or PE backed SaaS company that is focused on profits/EBITDA.
Yeah, the secret sauce of the dd libs was/is addictive for sure! I think its perhaps better now you can just use oTel for custom traces and oTel contrib libs for auto instrumentation and send that to the dd agent? I have not yet tried it because i suspected labels and other things might be named differently than the DD auto instrumentation/contrib packages, but i don't think the gap is as big now?
One thing that always surprises me, is that people havn't made more of a fuss about docker for mac. By default on install it shares the whole hard disk (unless thats changed), meaning without sudo you can get privileged access to the whole filesystem. I scope it down to my user folder, but the defaults are dangerous.
Do you also think running any regular software is "dangerous"? Because that gets to access your disk as well. Docker is not for security isolation, it is for distributing apps so they'll even run on your mac.
While your point about Docker’s primary purpose is valid, containerization is commonly used for security isolation as well. With proper configuration, it can be very useful towards this end.
Can you suggest any preferred alternative methods of isolation that offer similar efficacy and ease of use for quickly running complete software systems made by an unknown/untrusted actor?
> With proper configuration, it can be very useful towards this end.
It can. I think it's fair to assume that the standard developer setup to let them be productive is not this proper configuration.
> Can you suggest any preferred alternative methods of isolation that offer similar efficacy and ease of use for quickly running complete software systems made by an unknown/untrusted actor?
No. It's a hard problem! If it was easily solved we wouldn't be seeing all this development surrounding e.g. WebAssembly
Docker has had security and isolation features since it was competing with LXC on who glued cgroups and namespaces together better — and discussed in those terms the whole time.
While I agree that Docker as written isn’t good at security, your post has big “they’re holding the iPhone wrong!” vibes — and seemingly ignores the historic reasons that people would think it provides security.
> your post has big “they’re holding the iPhone wrong!” vibes
More like "it just isn't meant to be used for that". At least not in the default configuration, and that's fine!
> seemingly ignores the historic reasons that people would think it provides security
I've been using docker since it was announced. People have always been very clear that docker is not a security boundary, at least not with its default configuration.
I think your point is valid. Docker was indeed all about developer productivity in the beginning and it's up to infrastructure operator to lock it down.
The kubernetes provider, and kubectl works, but its not the nicest way of making changes. Its slow, quite clunky, and its not particularly intuitive. If your just getting started, and you know terraform its ok though. Its useful to bootstrap gitops tools like Argo or FluxCD though.
Helm diff will show you a similar diff to terraform. Running Helmfile in CD isn't a bad move, its really simple, and its a pattern that is easy to grok by any engineer. I think this is still a valid approach in a simple setup, its what some people call "CD OPS". It's a push model instead of pull, and there are downsides, but its not the end of the world.
Ultimately, at scale, i think gitops tooling like Flux and ArgoCD are some of the nicest patterns. Especially Flux's support for OCI artifacts as a source of truth. However then you will venture into the realm of kustomize, and much more complex tooling and concepts, which is not always worth doing.
Rotating out nodes during an upgrade is slow and potentially disruptive, however your systems should be built to handle this, and this is a good way of forcing it.
Deploy both and see which patterns you prefer, and what fits into your organisation better.
I have used both, but find Argo can be unnecessarily complex, and focuses solely as git as a source of truth for your k8s resources. The image updater can even write back to git to reflect version numbers etc, which is arguably an anti-pattern (git is not a database). However, the UI is excellent and is very powerful, and if your just getting started in the gitops space, its very intuitive.
I feel like the weaveworks team (who created flux) have encountered the problem of using git as a source of truth at scale. They let you specify other sources such as S3 and OCI containers, this gives you a lot more power to build custom, powerful workflows.
This means that you define your k8s resources (kustomizations definitions defining k8s resources, and flux resources) in git, but build, lint and test them in a CI/CD pipeline and publish them as a container. Then you can just tag that container with the cluster name or environment and treat your k8s resources like you would code. You can observe this with the flux ui too.
I think people get too hung up on the git part of gitops. All infrastructure should be defined in a version control system, and follow a sane CI process, but the way your cluster pulls that state to enforce it should be any source that is a reflection of that versioned code in SCM.
Absolutely, and argo really falls short when you have more complex patterns, like monorepos and promotion between different envs. Then you have to revert to argo events and workflows anyways and script your way through it.
It is mainly per host $15 or $23 PCM with the first 5-10 containers free then $0.002 per hour (~$1.5 PCM) per container. The insight and stats you get are quite granular and valuable however. For large scale deployments you can ignore certain containers etc.
reply