stefan_bobev's comments

stefan_bobev · 2025-12-22T23:31:51 1766446311

I am slowly waking up to the realization that we (software engineers) are laughably bad at security. I used to think that it was only NPM (I have worked a lot in this ecosystem over the years), but I have found this to be essentially everywhere: NPM is a poster child for this because of executable scripts on install, but every package manager essentially boils down to "Install this thing by name, no security checks". Every ecosystem I touch now (apart from gamedev, but only because I roll everything myself there by choice) has this - e.g Cargo has a lot of "tools" that you install globally so that you get some capability (like flamegraphs, asm output, test runners etc.) - this is the same vulnerability, manifesting slightly differently. Like others have pointed out, it is common to just pull random Docker images via Helm charts. It is also common to get random "utility" tools during builds in CI/CD pipelines, just by curl-ing random URLs of various "release archives". You don't even have to look too hard - this is surface level in pretty much every company, almost every industry (I have my doubts about the security theatre in some, but I have no first hand experience, so cannot say)

The issue I have is that I don't really have a good idea for a solution to this problem - on one hand, I don't expect everyone to roll the entire modern stacks by hand every time. Killing collaborative software development seems like literally throwing the baby out with the bath water. On the other hand, I feel like nothing I touch is "secure" in any real sense - the tick boxes are there, and they are all checked, but I don't think a single one of them really protects me against anything - most of the time, the monster is already inside the house.

Muromec · 2025-12-23T00:01:52 1766448112

>The issue I have is that I don't really have a good idea for a solution to this problem - on one hand, I don't expect everyone to roll the entire modern stacks by hand every time. Killing collaborative software development seems like literally throwing the baby out with the bath water.

Is NPM really collaborative? People just throw stuff out there and you can pick it up. It's the least commons denominator of collaboration.

The thing that NPM is missing is trust and trust doesn't scale to 1000x dependencies.

nicoburns · 2025-12-23T00:41:56 1766450516

IMO the solution is auditing. We should be auditing every single version of every single dependency before we use it. Not necessarily personally, but we could have a review system like Ebay/Uber/AirBnB and require N trusted reviews.

ryandrake · 2025-12-23T02:01:18 1766455278

This is the way. But people read it, nod their heads, and then go back to yolo'ing dependencies into their project without reading them. Culture change is needed.

nicoburns · 2025-12-23T16:51:57 1766508717

> Culture change is needed.

Yes, but IMO a tooling change is needed first. There just isn't good infrastructure fir doing this.

Jarwain · 2025-12-23T00:44:14 1766450654

Something that I keep thinking about is spec driven design.

If, for code, there is a parallel "state" document with the intent behind each line of code, each function

And in conjunction that state document is connected to a "higher layer of abstraction" document (recursively up as needed) to tie in higher layers of intent

Such a thing would make it easier to surface weird behavior imo, alongside general "spec driven design" perks. More human readable = more eyes, and potential for automated LLM analysis too.

I'm not sure it'd be _Perfect_, but I think it'd be loads better than what we've got now

Cyph0n · 2025-12-22T23:41:39 1766446899

I think the solution is a build system that requires version pinning - options include Nix, Bazel, and Buck.

stefan_bobev · 2025-12-23T00:00:50 1766448050

I am a big fan of Bazel and have explored Nix (although, regrettably not used it in anger quite yet) - both seem like good steps in the right direction and something I would love to see more usage/evolution of. However, it is important to recognize that these tools have a steep learning curve and require deep knowledge in more than one aspect in order to be used effectively/at all.

Speed of development and development experience are not metrics to be minimized/discarded lightly. If you were to start a company/product/project tomorrow, a lot of the things you want to be doing in the beginning are not related to these tools. You probably, most of the time, want to be exploring your solution space. Creating a development and CI/CD environment that can fully take advantage of these tools capabilities (like hermeticity and reproducibility) is not straightforward - in most cases setting up, scaling and maintaining these often requires a whole team with knowledge that most developers won't have. You don't want to gatekeep the writing of new software behind such requirements. But I do agree that the default should be closer to this, than what we have today. How we get there - now that is the million dollar question.

trollbridge · 2025-12-23T05:25:54 1766467554

Back in the days of Makefilea and autoconf, we tended to require specific versions and would document that in the readme.

LtWorf · 2025-12-23T09:11:03 1766481063

Unless you audit the version you're pinning, what's the difference?

stefan_bobev · 2025-10-23T19:27:25 1761247645

I appreciate the details this went through, especially laying out the exact timelines of operations and how overlaying those timelines produces unexpected effects. One of my all time favourite bits about distributed systems comes from the (legendary) talk at GDC - I Shot You First[1] - where the speaker describes drawing sequence diagrams with tilted arrows to represent the flow of time and asking "Where is the lag?". This method has saved me many times, all throughout my career from making games, to livestream and VoD services to now fintech. Always account for the flow of time when doing a distributed operation - time's arrow always marches forward, your systems might not.

But the stale read didn't scare me nearly as much as this quote:

> Since this situation had no established operational recovery procedure, engineers took care in attempting to resolve the issue with DWFM without causing further issues

Everyone can make a distributed system mistake (these things are hard). But I did not expect something as core as the service managing the leases on the physical EC2 nodes to not have recovery procedure. Maybe I am reading too much into it, maybe what they meant was that they didn't have a recovery procedure for "this exact" set of circumstances, but it is a little worrying even if that were the case. EC2 is one of the original services in AWS. At this point I expect it to be so battle hardened that very few edge cases would not have been identified. It seems that the EC2 failure was more impactful in a way, as it cascaded to more and more services (like the NLB and Lambda) and took more time to fully recover. I'd be interested to know what gets put in place there to make it even more resilient.

[1] https://youtu.be/h47zZrqjgLc?t=1587

tptacek · 2025-10-23T19:30:58 1761247858

It shouldn't scare you. It should spark recognition. This meta-failure-mode exists in every complex technological system. You should be, like, "ah, of course, that makes sense now". Latent failures are fractally prevalent and have combinatoric potential to cause catastrophic failures. Yes, this is a runbook they need to have, but we should all understand there are an unbounded number of other runbooks they'll need and won't have, too!

lazystar · 2025-10-23T19:47:20 1761248840

the thing that scares me is that AI will never be able to diagnose an issue that it has never seen before. If there are no runbooks, there is no pattern recognition. this is something Ive been shouting about for 2 years now; hopefully this issue makes AWS leadership understand that current gen AI can never replace human engineering.

tptacek · 2025-10-23T19:50:22 1761249022

I'm much less confident in that assertion. I'm not bullish on AI systems independently taking over operations from humans, but catastrophic outages are combinations of less-catastrophic outages which are themselves combinations of latent failures, and when the latent failures are easy to characterize (as is the case here!), LLMs actually do really interesting stuff working out the combinatorics.

I wouldn't want to, like, make a company out of it (I assume the foundational model companies will eat all these businesses) but you could probably do some really interesting stuff with an agent that consumes telemetry and failure model information and uses it to surface hypos about what to look at or what interventions to consider.

All of this is besides my original point, though: I'm saying, you can't runbook your way to having a system as complex as AWS run safely. Safety in a system like that is a much more complicated process, unavoidably. Like: I don't think an LLM can solve the "fractal runbook requirement" problem!

janalsncm · 2025-10-24T06:30:22 1761287422

AI is a lot more than just LLMs. Running through the rats nest of interdependent systems like AWS has is exactly what symbolic AI was good at.

Aeolun · 2025-10-24T04:03:44 1761278624

I think millions of systems have failed due to missing DNS records though.

gtowey · 2025-10-24T05:16:49 1761283009

It's shocking to me too, but not very surprising. It's probably a combination of factors that could cause a failure of planning and I've seen it play out the same way at lots of companies.

I bet the original engineers planned for, and designed the system to be resilient to this cold start situation. But over time those engineers left, and new people took over -- those who didn't fully understand and appreciate the complexity, and probably didn't care that much about all the edge cases. Then, pushed by management to pursue goals that are antithetical to reliability, such as cost optimization and other things the new failure case was introduced by lots of sub optimal changes. The result is as we see it -- a catastrophic failure which caught everyone by surprise.

It's the kind of thing that happens over and over again when the accountants are in charge.

throwdbaaway · 2025-10-24T06:22:40 1761286960

> But I did not expect something as core as the service managing the leases on the physical EC2 nodes to not have recovery procedure.

I guess they don't have a recovery procedure for the "congestive collapse" edge case. I have seen something similar, so I wouldn't be frowning at this.

A couple of red flags though:

1. Apparent lack of load-shedding support by this DWFM, such that a server reboot had to be performed. Need to learn from https://aws.amazon.com/builders-library/using-load-shedding-...

2. Having DynamoDB as a dependency of this DWFM service, instead of something more primitive like Chubby. Need to learn more about distributed systems primitives from https://www.youtube.com/watch?v=QVvFVwyElLY

stefan_bobev · on May 31, 2021

One of the things I think goes a little underappreciated about Unreal is the fact that everyone gets the source code. Interested in how Nanite or Lumen work? The source is right there! With all the comments (or lack of), with all the debug statements and branches that can be used to diagnose the behaviour of the system.

Often while developing, I dive into the source of the engine to understand how exactly some low level system works. I also blatantly copy all the complex UI widgets available in the editor when I want to extend them/make custom ones for my games (I hate UI programming). This is invaluable for teaching the next generation of engine developers imo.

gentleman11 · on May 31, 2021

True, but does it expose you to liability if you later do engine dev work elsewhere? Actually curious how that works

zarzavat · on May 31, 2021

Only if they have patented the algorithms and I seriously doubt Epic sees a future for itself as a parent troll.

stefan_bobev · on May 4, 2021

All of this reminds me of one of the best GDC talks ever given: https://www.youtube.com/watch?v=E8Lhqri8tZk

unixhero · on May 4, 2021

This is like the Defcon of awesome geekiness conferences.

I've seen many awesome presentations from GDC. There should be an awesomelist collecting these.

stefan_bobev · on July 13, 2020

A camera like this makes me question something - why was it installed in the first place? You can't distinguish things like license plates on cars or faces (I doubt ML helps here either). So what is this for? The view is beautiful, but I fail to see the purpose of it.

Symbiote · on July 13, 2020

Advertising. It's linked from the hotel's website: https://www.royalhotelsanremo.com/en/webcam-sanremo

_abox · on July 13, 2020

Ah and in this case it actually makes sense for it to have no password. Unlike many of the elevator and office cams I'm seeing ;)

stefan_bobev · on Oct 21, 2019

This surprises me: there is an IX right on the border between Turkey and Bulgaria [1]. All other IXs are located in the capital. Is there a reason an IX would be located there?

[1]: https://www.internetexchangemap.com/#/internet-exchange/balc...