I am slowly waking up to the realization that we (software engineers) are laughably bad at security. I used to think that it was only NPM (I have worked a lot in this ecosystem over the years), but I have found this to be essentially everywhere: NPM is a poster child for this because of executable scripts on install, but every package manager essentially boils down to "Install this thing by name, no security checks". Every ecosystem I touch now (apart from gamedev, but only because I roll everything myself there by choice) has this - e.g Cargo has a lot of "tools" that you install globally so that you get some capability (like flamegraphs, asm output, test runners etc.) - this is the same vulnerability, manifesting slightly differently. Like others have pointed out, it is common to just pull random Docker images via Helm charts. It is also common to get random "utility" tools during builds in CI/CD pipelines, just by curl-ing random URLs of various "release archives". You don't even have to look too hard - this is surface level in pretty much every company, almost every industry (I have my doubts about the security theatre in some, but I have no first hand experience, so cannot say)
The issue I have is that I don't really have a good idea for a solution to this problem - on one hand, I don't expect everyone to roll the entire modern stacks by hand every time. Killing collaborative software development seems like literally throwing the baby out with the bath water. On the other hand, I feel like nothing I touch is "secure" in any real sense - the tick boxes are there, and they are all checked, but I don't think a single one of them really protects me against anything - most of the time, the monster is already inside the house.
>The issue I have is that I don't really have a good idea for a solution to this problem - on one hand, I don't expect everyone to roll the entire modern stacks by hand every time. Killing collaborative software development seems like literally throwing the baby out with the bath water.
Is NPM really collaborative? People just throw stuff out there and you can pick it up. It's the least commons denominator of collaboration.
The thing that NPM is missing is trust and trust doesn't scale to 1000x dependencies.
IMO the solution is auditing. We should be auditing every single version of every single dependency before we use it. Not necessarily personally, but we could have a review system like Ebay/Uber/AirBnB and require N trusted reviews.
This is the way. But people read it, nod their heads, and then go back to yolo'ing dependencies into their project without reading them. Culture change is needed.
Something that I keep thinking about is spec driven design.
If, for code, there is a parallel "state" document with the intent behind each line of code, each function
And in conjunction that state document is connected to a "higher layer of abstraction" document (recursively up as needed) to tie in higher layers of intent
Such a thing would make it easier to surface weird behavior imo, alongside general "spec driven design" perks. More human readable = more eyes, and potential for automated LLM analysis too.
I'm not sure it'd be _Perfect_, but I think it'd be loads better than what we've got now
I am a big fan of Bazel and have explored Nix (although, regrettably not used it in anger quite yet) - both seem like good steps in the right direction and something I would love to see more usage/evolution of. However, it is important to recognize that these tools have a steep learning curve and require deep knowledge in more than one aspect in order to be used effectively/at all.
Speed of development and development experience are not metrics to be minimized/discarded lightly. If you were to start a company/product/project tomorrow, a lot of the things you want to be doing in the beginning are not related to these tools. You probably, most of the time, want to be exploring your solution space. Creating a development and CI/CD environment that can fully take advantage of these tools capabilities (like hermeticity and reproducibility) is not straightforward - in most cases setting up, scaling and maintaining these often requires a whole team with knowledge that most developers won't have. You don't want to gatekeep the writing of new software behind such requirements. But I do agree that the default should be closer to this, than what we have today. How we get there - now that is the million dollar question.
I appreciate the details this went through, especially laying out the exact timelines of operations and how overlaying those timelines produces unexpected effects. One of my all time favourite bits about distributed systems comes from the (legendary) talk at GDC - I Shot You First[1] - where the speaker describes drawing sequence diagrams with tilted arrows to represent the flow of time and asking "Where is the lag?". This method has saved me many times, all throughout my career from making games, to livestream and VoD services to now fintech. Always account for the flow of time when doing a distributed operation - time's arrow always marches forward, your systems might not.
But the stale read didn't scare me nearly as much as this quote:
> Since this situation had no established operational recovery procedure, engineers took care in attempting to resolve the issue with DWFM without causing further issues
Everyone can make a distributed system mistake (these things are hard). But I did not expect something as core as the service managing the leases on the physical EC2 nodes to not have recovery procedure. Maybe I am reading too much into it, maybe what they meant was that they didn't have a recovery procedure for "this exact" set of circumstances, but it is a little worrying even if that were the case. EC2 is one of the original services in AWS. At this point I expect it to be so battle hardened that very few edge cases would not have been identified. It seems that the EC2 failure was more impactful in a way, as it cascaded to more and more services (like the NLB and Lambda) and took more time to fully recover. I'd be interested to know what gets put in place there to make it even more resilient.
It shouldn't scare you. It should spark recognition. This meta-failure-mode exists in every complex technological system. You should be, like, "ah, of course, that makes sense now". Latent failures are fractally prevalent and have combinatoric potential to cause catastrophic failures. Yes, this is a runbook they need to have, but we should all understand there are an unbounded number of other runbooks they'll need and won't have, too!
the thing that scares me is that AI will never be able to diagnose an issue that it has never seen before. If there are no runbooks, there is no pattern recognition. this is something Ive been shouting about for 2 years now; hopefully this issue makes AWS leadership understand that current gen AI can never replace human engineering.
I'm much less confident in that assertion. I'm not bullish on AI systems independently taking over operations from humans, but catastrophic outages are combinations of less-catastrophic outages which are themselves combinations of latent failures, and when the latent failures are easy to characterize (as is the case here!), LLMs actually do really interesting stuff working out the combinatorics.
I wouldn't want to, like, make a company out of it (I assume the foundational model companies will eat all these businesses) but you could probably do some really interesting stuff with an agent that consumes telemetry and failure model information and uses it to surface hypos about what to look at or what interventions to consider.
All of this is besides my original point, though: I'm saying, you can't runbook your way to having a system as complex as AWS run safely. Safety in a system like that is a much more complicated process, unavoidably. Like: I don't think an LLM can solve the "fractal runbook requirement" problem!
It's shocking to me too, but not very surprising. It's probably a combination of factors that could cause a failure of planning and I've seen it play out the same way at lots of companies.
I bet the original engineers planned for, and designed the system to be resilient to this cold start situation. But over time those engineers left, and new people took over -- those who didn't fully understand and appreciate the complexity, and probably didn't care that much about all the edge cases. Then, pushed by management to pursue goals that are antithetical to reliability, such as cost optimization and other things the new failure case was introduced by lots of sub optimal changes. The result is as we see it -- a catastrophic failure which caught everyone by surprise.
It's the kind of thing that happens over and over again when the accountants are in charge.
> But I did not expect something as core as the service managing the leases on the physical EC2 nodes to not have recovery procedure.
I guess they don't have a recovery procedure for the "congestive collapse" edge case. I have seen something similar, so I wouldn't be frowning at this.
2. Having DynamoDB as a dependency of this DWFM service, instead of something more primitive like Chubby. Need to learn more about distributed systems primitives from https://www.youtube.com/watch?v=QVvFVwyElLY
One of the things I think goes a little underappreciated about Unreal is the fact that everyone gets the source code. Interested in how Nanite or Lumen work? The source is right there! With all the comments (or lack of), with all the debug statements and branches that can be used to diagnose the behaviour of the system.
Often while developing, I dive into the source of the engine to understand how exactly some low level system works. I also blatantly copy all the complex UI widgets available in the editor when I want to extend them/make custom ones for my games (I hate UI programming). This is invaluable for teaching the next generation of engine developers imo.
A camera like this makes me question something - why was it installed in the first place? You can't distinguish things like license plates on cars or faces (I doubt ML helps here either). So what is this for? The view is beautiful, but I fail to see the purpose of it.
This surprises me: there is an IX right on the border between Turkey and Bulgaria [1]. All other IXs are located in the capital. Is there a reason an IX would be located there?
The issue I have is that I don't really have a good idea for a solution to this problem - on one hand, I don't expect everyone to roll the entire modern stacks by hand every time. Killing collaborative software development seems like literally throwing the baby out with the bath water. On the other hand, I feel like nothing I touch is "secure" in any real sense - the tick boxes are there, and they are all checked, but I don't think a single one of them really protects me against anything - most of the time, the monster is already inside the house.