The article wasn't about the outage happening, it was about the amount of time it took to even discover what the problem was. Seems logical to assume that could be because there aren't many people left who know how all the systems connect.
> Seems logical to assume that could be because there aren't many people left who know how all the systems connect.
It's only logical presupposing a lot of other conditions, each of which is worthy of healthy skepticism. And even then, it's only a hypothesis. You need evidence to go from "this could have contributed to the problem" to "this caused the problem."
Based on what little is given in the article, it seems to go strongly against this hypothesis. For example it links to multiple past findings that Amazon's notification times need improvement going back to 2017. If something has been a problem for nearly a decade, it's hard to imagine it is a result of any recent personnel changes.
TFA does not establish how many AWS workers have left or been laid off, nonetheless how many of those were actually undesirable losses of highly skilled individuals. Even if we take it on faith that a large number of such individuals were lost, it is another bridge further to claim that there was neither redundancy in that skillset which remained, nor that any vacancies have been left unfilled since.
No evidence is given that indicates that if a more experienced team were working on the problem it would have been identified and resolved faster. The article even states something to the opposite effect:
> AWS is very, very good at infrastructure. You can tell this is a true statement by the fact that a single one of their 38 regions going down (albeit a very important region!) causes this kind of attention, as opposed to it being "just another Monday outage." At AWS's scale, all of their issues are complex; this isn't going to be a simple issue that someone should have caught, just because they've already hit similar issues years ago and ironed out the kinks in their resilience story.
Indeed, the article doesn't even provide evidence that the response was unreasonably slow. No comparison to similar outages either from AWS in the past, before the hypothecated brain drain, nor from competitors. Note that the author has no idea what the problem actually was, or what AWS had to do to diagnose the issue.
It's the most plausible, fact-based guess, beating other competing theories.
Understaffing and absences would clearly lead to delayed incident response, but such an obvious negligence and breach of contract would have been avoided by a responsible cloud provider, ensuring supposedly adequate people on duty.
An exceptionally challenging problem is unlikely to be enough to cause so much fumbling because, regardless of the complex mistakes behind it, a DNS misunderstanding doesn't have a particularly large "surface area" for diagnostic purposes and it is supposed to be expeditely resolvable by standard means (ordering clients to switch to a good DNS server and immediately use it to obtain good addresses) that AWS should have in place.
AWS engineers being formerly competent but currently stupid, without organizational issues, might be explained by brain damage. "RTO" might have caused collective chronic poisoning, e.g. lead in drinking water, but I doubt Amazon is so cheap.
> An exceptionally challenging problem is unlikely to be enough to cause so much fumbling because, regardless of the complex mistakes behind it, a DNS misunderstanding doesn't have a particularly large "surface area" for diagnostic purposes and it is supposed to be expeditely resolvable by standard means (ordering clients to switch to a good DNS server and immediately use it to obtain good addresses) that AWS should have in place
You seem to be misunderstanding the nature of the issue.
The DNS records for DynamoDB's API disappeared. They resolve to a dynamic bunch of IPs that constantly change.
A ton of AWS services that use DynamoDB could no longer do so. Hardcoding IPs wasn't an option. Nor could clients do anything on their side.
> a DNS misunderstanding doesn't have a particularly large "surface area" for diagnostic purposes and it is supposed to be expeditely resolvable by standard means (ordering clients to switch to a good DNS server and immediately use it to obtain good addresses)
Did you consider that DNS might’ve been a symptom? If the DynamoDB DNS records use a health-check, switching DNS servers will not resolve the issue and might make it worse by directing an unusually high volume of traffic at static IPs without autoscaling or fault recovery.
The article describes evidence for a concrete, straightforward organizational decay pattern that can explain a large part of this miserable failure. What's "self-serving" about such a theory?
My personal "guess" is that failing to retain knowledge and talent is only one of many components of a well-rounded crisis of bad management and bad company culture that has been eroding Amazon on more fronts than AWS reliability.
What's your theory? Conspiracy within Amazon? Formidable hostile hackers? Epic bad luck? Something even more movie-plot-like? Do you care about making sense of events in general?
We've witnessed someone repeatedly shoot themselves in the foot a few months ago. It is indeed a guess that it may cause their current foot pain, but it is a rather safe one.
Twice I've had to deal with outages where the root cause took a long time to find because there were several distinct root causes interacting in ways that made it difficult or impossible to reproduce the problem in an isolated way, or to even reason about the problem until we started figuring out that there were multiple unrelated root causes. All other outages I've dealt with were the source where experienced engineers and institutional knowledge were sufficient to quickly find the cause and fix it.
Which is to say: it's entirely possible that the inferences drawn by TFA are just wrong. And it's also possible that TFA is wrong but also right to express concern with how Amazon manages talent.
It's about the time between the announcements about finding the cause. I find that to be thin evidence. There are far too many alternate explanations. It's not even that I find the idea to be implausible, but I don't think the article's doom-saying confidence level is warranted.
This is what I always do. Rather than go directly from the card reader or camera into Photos or Lightroom, I copy the files onto an SSD, and then bring them in from the SSD. The entire process goes faster.
I also want to point out that I've seen similar corruption in the past, only in Lightroom. The culprit ended up being hardware, not software. Specifically, the card reader's USB cable. I've actually had two of these cables fail on different readers. On the most recent one, I replaced it with a nicer Micro B to USB C cable, and haven't had an issue.
I haven't had actual corruption but had imports take an excessive long time or fail to complete in Lightroom because of bad USB cables or (I think) bad USB jack.
Generally I'm frustrated with the state of USB. Bad cables are all over the place and I'm inclined to throw cables out if I have the slightest problem with them. My take is that the import process with Lightroom is fast and reliable if I am using good readers and good cables; it is fine importing photos from my Sony a7iv off a CFExpress card but my Sony a7ii has always been problematic and benefits greatly from taking the memory card out and putting it in a dedicated reader, sometimes I use the second slot in the a7iv.
I use Lightroom, but always with this workflow (copy files from memory card to disk, then use LR to do the import / move / build previews).
If nothing else, it lets you get your card back much more quickly, as a file-system copy runs at ~1500MBps, which makes a difference when importing 50-100GB of photos.
I also don't delete the images off the memory card until they've been backed up from the disk to some additional medium.
I haven't tried super fast memory cards, but with why I have, importing by copying from the card maxes out the card (~100 MB/s). Bonus points for the preview generation starting while the copy is ongoing.
I see extreme irony being lectured about reality by a person who believes I'm shilling LLMs. It doesn't make any sense, are you sure you're not an LLM?
That defeats the whole point of this issue. These uses are fair use, they shouldn't have to license anything. You can't teach music without playing it, Youtube is just allowing rights holders to make claims without any evidence or punishment for being wrong.
Fair use is the problem. It's too ambiguous and as a result lawyers can play the games they're playing. My solution is dirt simple, keeps everybody happy, and quits wasting time pretending we're living in 1998.
Have to agree. I've tried multiple times to replace my. edison FiOS router with different Edgerouters and none of them have been able to compare to the Gigabit speeds I get with the Verizon router. I'm not even using wifi, just want a simple router with a firewall and port forwarding that can compare to my $12/mo one from Verizon. I troubleshooted each for a eeek tweaking hardware acceleration and other knobs, but they couldn't keep up. I think people don't compare and test and just assume it's just as good, but it isn't.
Weird. I got sustained symmetric gigabit speed out of an EdgeRouter Lite when it was loaded with a basic firewall and some port forwarding. At the time I purchased it, the thing cost about ten months of your ISP-provided equipment rental.
Maybe the later EdgeRouters are total trash, but the ERL could (and did) totally handle what you're describing.
Eh, part of assessing the vulnerability is how deep it goes. Showing that there were no gates or roadblocks to accessing all the data is a valid thing to research, otherwise they can later say "oh we hade rate limiting in place" or "we had network vulnerability scanners which would've prevented a wholesale leak".
Hey there--pentester, security researcher, and bug bounty hunter here.
"Demonstrating impact" is common practice. The presence (or non-presence) of rate limiting controls, such as those alluded to by the commenter above, can play into the risk assigned to a vulnerability, and may be difficult to ascertain without actually attempting an otherwise theoretical attack. This also has the effect of indicating whether the target has adequate detection capabilities, which is important information.
Demonstrating impact is also just sometimes necessary to convey urgency to leadership; hand waving is common. Alternatively, some organizations may silently patch without performing a responsible disclosure, such as was the case with this article. Having hard proof that the attack was 1) viable and 2) not detected is critical information in the event that you must disclose to the public.
As an aside, from your history:
> My one gripe with HN is that people say incorrect things with complete confidence pretty regularly and you can only Detect it if you know the subject matter.
Welcome to being part of the problem. Remember the feeling.
Also a security professional, pentester, bug bounty hunter, multitude of other irrelevant self-imposed titles owner here.
You’ve demonstrated impact by small amounts of enumeration. If you had any real experience in bug bounty contracts you would know 2 things:
Almost all contracts ask you not to enumerate the entire data set as 2 or 3 records is enough (again, that’s how security controls work) and no one is interested in hearing about rate-limiting on public bounties. Pentesting sure, but that’s not what we’re talking about.
Source, 2 decades in the security industry at large in all kinds of positions.
And a note for future reference. If you think I’m out of line for my snark then don’t give what you can’t take.
Edit: Oh, and as someone on both sides of the fence enumerating an entire data set against scope is in the top ten reasons people get booted from programs. To anyone else seeing this chain: don’t do it. YOU DO NOT NEED TO TO PROVE IMPACT. Respect people’s privacy.