More

willejs · 2025-07-29T22:25:44 1753827944

Hugops to the people working on this for the last 31+ hours. Running incidents of this significance is hard, draining and requires a lot of effort, this going on for so long must be very difficult for all involved.

bravesoul2 · 2025-07-30T03:14:22 1753845262

Hopefully they are rotating teams not people staying awake for a dangerous amount of time.

willejs · 2025-07-23T13:29:46 1753277386

Running software in an airgapped environment is difficult, but the hardest thing is the install, packaging and shipping updates. I have used https://zarf.dev/ to do this for a government client, and it was an amazing experience. I highly recommend it. K8s seems heavy, but if you want to run datastores with backups (k8s operators), or highly customised environments, and automate all of that, instead of loads of bash and custom code, it shines.

willejs · 2025-07-16T07:51:27 1752652287

If you carry on reading, its quite obvious they misconfigured a service and routed production traffic to that instead of the correct service, and the system used to do that was built in 2018 and is considered legacy (probably because you can easily deploy bad configs). Given that, I wouldn't say the summary is "inscrutable corporatese" whatever that is.

bigiain · 2025-07-16T08:20:51 1752654051

I agree it's not "inscrutable corporatese"

It's carefully written so my boss's boss thinks he understands it, and that we cannot possibly have that problem because we obviously don't have any "legacy components" because we are "modern and progressive".

It is, in my opinion, closer to "intentionally misleading corporatese".

noduerme · 2025-07-16T08:39:06 1752655146

Joe Shmo committed the wrong config file to production. Innocent mistake. Sally caught it in 30 seconds. We were back up inside 2 minutes. Sent Joe to the margarita shop to recover his shattered nerves. Kid deserves a raise. Etc.

sathackr · 2025-07-16T09:51:32 1752659492

Yea the "timeline" indicating impact start/end is entirely false when you look at the traffic graph shared later in the post.

Or they have a different definition of impact than I do

willejs · 2025-06-30T22:49:46 1751323786

I have run ELK, Grafana + Prom, Grafana + Thanos/Coretex, New relic and all of the more traditional products for monitoring/observability. More recently in the last few years, I have been running full observability stacks via either The Grafana LGTM stack or datadog at a reasonable scale and complexities. Ultimately you want one tool that can alert you off a metric, present you some traces, and drill down into logs, all the way down the stack.

I have found Datadog to be, by far hands down the best developer experience from the get go, the way it glues the mostly decent products together is unparalleled in comparison to other products (Grafana cloud/LGTM). I usually say if your at a small to medium scale business just makes sense, IF you understand the product and configure it correctly which is reasonably easy. The seamless integration between tracing, logging and metrics in the platform, which you can then easily combine with alerts is great. However, its easy to misconfigure it and spend a lot of money on seemingly nothing. If you do not implement tracing and structured logs (at the right volume and level) with trace/span ids etc all the way through services its hard to see the value, and seems expensive. It requires some good knowledge, and configuration of the product to make it pay off. The rest of the product features are generally good, for example their security suite is a good entry level to cloud security monitoring and SEIM too.

However, when you get to a certain scale, the cost of APM and Infrastructure hosts in Datadog can become become somewhat prohibitive. Also, Datadogs custom metrics pricing is somewhat expensive and its query language cababilities does not quite match the power of promql, and you start to find yourself needed them to debug issues. At that point, the self hosted LGTM stack starts to make sense, however, it involves a lot more education for end users in both integration (a little less now Otel is popular) and querying/building dashboards etc, but also running it yourself. The grafana cloud platform is more attractive though.

SOLAR_FIELDS · 2025-06-30T23:07:28 1751324848

My experience mirrors yours wrt Datadog. It's incredible value at low scale, you get a full robust system with great devex for pennies. Once you hit that tipping point though, you are locked in pretty hardcore. Datadog snakes its way far into your codebase, with all the custom tracing and stuff like that. Migrating off of it is a very expensive endeavor, which is probably one of the reasons why they are such a money printing operation.

mbesto · 2025-06-30T23:50:11 1751327411

I think "medium scale" is probably more appropriate. For a $3M~$5M revenue SaaS you're still paying $50k+/year. That's not nothing for a small owner or PE backed SaaS company that is focused on profits/EBITDA.

willejs · 2025-06-30T23:12:59 1751325179

Yeah, the secret sauce of the dd libs was/is addictive for sure! I think its perhaps better now you can just use oTel for custom traces and oTel contrib libs for auto instrumentation and send that to the dd agent? I have not yet tried it because i suspected labels and other things might be named differently than the DD auto instrumentation/contrib packages, but i don't think the gap is as big now?

willejs · 2025-05-06T21:32:12 1746567132

Can this keep popping up, interrupt you, and have the most annoying voice ever added please?

willejs · on May 28, 2024

One thing that always surprises me, is that people havn't made more of a fuss about docker for mac. By default on install it shares the whole hard disk (unless thats changed), meaning without sudo you can get privileged access to the whole filesystem. I scope it down to my user folder, but the defaults are dangerous.

beeboobaa3 · on May 28, 2024

Do you also think running any regular software is "dangerous"? Because that gets to access your disk as well. Docker is not for security isolation, it is for distributing apps so they'll even run on your mac.

anais9 · on May 28, 2024

While your point about Docker’s primary purpose is valid, containerization is commonly used for security isolation as well. With proper configuration, it can be very useful towards this end.

Can you suggest any preferred alternative methods of isolation that offer similar efficacy and ease of use for quickly running complete software systems made by an unknown/untrusted actor?

beeboobaa3 · on May 28, 2024

> With proper configuration, it can be very useful towards this end.

It can. I think it's fair to assume that the standard developer setup to let them be productive is not this proper configuration.

> Can you suggest any preferred alternative methods of isolation that offer similar efficacy and ease of use for quickly running complete software systems made by an unknown/untrusted actor?

No. It's a hard problem! If it was easily solved we wouldn't be seeing all this development surrounding e.g. WebAssembly

DEADMINCE · on June 1, 2024

Bubblewrap, Firejail, SELinux, etc

goodpoint · on May 28, 2024

containerization is commonly MISused for security isolation

zmgsabst · on May 28, 2024

Docker has had security and isolation features since it was competing with LXC on who glued cgroups and namespaces together better — and discussed in those terms the whole time.

While I agree that Docker as written isn’t good at security, your post has big “they’re holding the iPhone wrong!” vibes — and seemingly ignores the historic reasons that people would think it provides security.

beeboobaa3 · on May 28, 2024

> your post has big “they’re holding the iPhone wrong!” vibes

More like "it just isn't meant to be used for that". At least not in the default configuration, and that's fine!

> seemingly ignores the historic reasons that people would think it provides security

I've been using docker since it was announced. People have always been very clear that docker is not a security boundary, at least not with its default configuration.

zmgsabst · on May 28, 2024

> People have always been very clear that docker is not a security boundary, at least not with its default configuration.

I’ve also used it since the beginning and that’s some mighty strong revisionism.

Docker was compared to VMs — with a tiny asterisk of fine print that it’s not actually configured to employ security features it’s built with.

beeboobaa3 · on May 28, 2024

> Docker was compared to VMs

By certain people, yes. They have always been wrong. Never by the docker team themselves.

gtirloni · on May 28, 2024

I think your point is valid. Docker was indeed all about developer productivity in the beginning and it's up to infrastructure operator to lock it down.

willejs · on Dec 15, 2023

The kubernetes provider, and kubectl works, but its not the nicest way of making changes. Its slow, quite clunky, and its not particularly intuitive. If your just getting started, and you know terraform its ok though. Its useful to bootstrap gitops tools like Argo or FluxCD though.

Helm diff will show you a similar diff to terraform. Running Helmfile in CD isn't a bad move, its really simple, and its a pattern that is easy to grok by any engineer. I think this is still a valid approach in a simple setup, its what some people call "CD OPS". It's a push model instead of pull, and there are downsides, but its not the end of the world.

Ultimately, at scale, i think gitops tooling like Flux and ArgoCD are some of the nicest patterns. Especially Flux's support for OCI artifacts as a source of truth. However then you will venture into the realm of kustomize, and much more complex tooling and concepts, which is not always worth doing.

willejs · on Dec 4, 2023

Rotating out nodes during an upgrade is slow and potentially disruptive, however your systems should be built to handle this, and this is a good way of forcing it.

willejs · on Oct 19, 2023

Deploy both and see which patterns you prefer, and what fits into your organisation better.

I have used both, but find Argo can be unnecessarily complex, and focuses solely as git as a source of truth for your k8s resources. The image updater can even write back to git to reflect version numbers etc, which is arguably an anti-pattern (git is not a database). However, the UI is excellent and is very powerful, and if your just getting started in the gitops space, its very intuitive.

I feel like the weaveworks team (who created flux) have encountered the problem of using git as a source of truth at scale. They let you specify other sources such as S3 and OCI containers, this gives you a lot more power to build custom, powerful workflows.

This means that you define your k8s resources (kustomizations definitions defining k8s resources, and flux resources) in git, but build, lint and test them in a CI/CD pipeline and publish them as a container. Then you can just tag that container with the cluster name or environment and treat your k8s resources like you would code. You can observe this with the flux ui too.

I think people get too hung up on the git part of gitops. All infrastructure should be defined in a version control system, and follow a sane CI process, but the way your cluster pulls that state to enforce it should be any source that is a reflection of that versioned code in SCM.

hanikesn · on Oct 19, 2023

Absolutely, and argo really falls short when you have more complex patterns, like monorepos and promotion between different envs. Then you have to revert to argo events and workflows anyways and script your way through it.

RyeCombinator · on Oct 19, 2023

Totally, then you have argo rollouts, argo workflows,argo events, and now kargo https://github.com/akuity/kargo

They're encouraging adoption of the entire stack, which is interesting on its own.

Argo has arguably done a much better job selling themselves with all the resources poured into their marketing.

willejs · on May 8, 2023

It is mainly per host $15 or $23 PCM with the first 5-10 containers free then $0.002 per hour (~$1.5 PCM) per container. The insight and stats you get are quite granular and valuable however. For large scale deployments you can ignore certain containers etc.