I think most of the outrage is not because "it happened" but because AWS is saying things like "S3 was unaffected" when the anecdotal experience of many in this thread suggests the opposite.
That and the apparent policy that a VP must sign off on changing status pages, which is... backwards to say the least.
> a VP must sign off on changing status pages, which is... backwards to say the least.
I think most people's experience with "VP's" makes them not realize what AWS VP's do.
VP's here are not sitting in an executive lounge wining and dining customers, chomping on cigars and telling minions to "Call me when the data center is back up and running again!"
They are on the tech call, working with the engineers, evaluating the problem, gathering the customer impact, and attempting to balance communicating too early with being precise.
Is there room for improvement? Yes. I wish we would just throw up a generic "Shit's Fucked Up. We Don't Know Why Yet, But We're Working On It" message.
But the reason why we don't, doesn't have anything to do with having to get VP approval to put that message up. The VP's are there in the trenches most of the time.
> I wish we would just throw up a generic "Shit's Fucked Up. We Don't Know Why Yet, But We're Working On It" message.
I gotta say, the implication that you can't register an outage until you know why it happened is pretty damning. The status page is where we look to see if services are effected, if that information can't be shared there until you understand the cause, that's very broken.
The AWS status page has become kind of a joke to customers.
I was encouraged to see the announcement in OP say that there is "a new version of our Service Health Dashboard" coming. I hope it can provide actual capabilities to display, well, service health.
From how people talk about it, it kind of sounds like updates to the Service Health Dashboard are currently purely a manual process. Rather than automated monitoring automatically updating the Service Health Dashboard in any way at all. I find that a surprising implementation for an organization of Amazon's competence and power. That alarms me more than who it is that has the power to manually update it; I agree that I don't have enough knowledge of AWS internal org structures to have an opinion on if it's the "right" people or not.
I suspect AWS must have internal service health pages that are actually automatically updated in some way by monitoring, that is, that actually work to display service health. It seems like a business decision rather than a technical challenge if the public facing system has no inputs but manual human entry, but that's just how it seems from the outside, I may not have full information. We only have what Amazon shares with us of course.
Can you please help me understand why you, and everyone else, are so passionate about the status page?
I get that it not being updated is an annoyance, but I cannot figure out why it is the single most discussed thing about this whole event. I mean, entire services were out for almost an entire day, and if you read HN threads it would seem that nobody even cares about lost revenue/productivity, downtime, etc. The vast majority of comments in all of the outage threads are screaming about how the SHD lied.
In my entire career of consulting across many companies and many different technology platforms, never once have I seen or heard of anyone even looking at a status page outside of HN. I'm not exaggerating. Even over the last 5 years when I've been doing cloud consulting, nobody I've worked with has cared at all about the cloud provider's status pages. The only time I see it brought up is on HN, and when it gets brought up on HN it's discussed with more fervor than most other topics, even the outage itself.
In my real life (non-HN) experience, when an outage happens, teams ask each other "hey, you seeing problems with this service?" "yea, I am too, heard maybe it's an outage" "weird, guess I'll try again later" and go get a coffee. In particularly bad situations, they might check the news or ask me if I'm aware of any outage. Either way, we just... go on with our lives? I've never needed, nor have I ever seen people need, a status page to inform them that things aren't working correctly, but if you read HN you would get the impression that entire companies of developers are completely paralyzed unless the status page flips from green to red. Why? I would even go as far to say that if you need a third party's SHD to tell you if things aren't working right, then you're probably doing something wrong.
Seriously, what gives? Is all this just because people love hating on Amazon and the SHD is an easy target? Because that's what it seems like.
A status page give you confidence that the problem indeed lies with Amazon and not your own software. I don't think it's very reasonable to notice issues, ask other teams if they are also having issues, and if so, just shrug it off and get a cup of coffee without more investigation. Just because it looks like the problem is with AWS, you can't be sure until you further investigate it, specially if the status page says it's all working fine.
I think it goes without saying that having an outage is bad, but having an outage which is not confirmed by the service provider is even worse. People complain about that a lot because it's the least they could do.
I care about status pages, because when something breaks upstream I need to know whether it's an issue I need to report, and if there's additional problems related to the outage I need to look out for, or workarounds I can deploy. If I find out anything that might help me narrow down the ETA for a fix, that's bonus fries.
I don't gripe about it on HN, but it is generally a disappointment to me when I stumble upon something that looks like a significant outage but a company is making no indication that they've seen it and are working on it (or waiting for something upstream of them, as sometimes happens).
It is extremely common for customers to care about being informed accurately about downtime, and not just for AWS. I think your experience of not caring and not knowing anyone who cares may be an outlier.
> Can you please help me understand why you, and everyone else, are so passionate about the status page?
I don't think people are "passionate about status page." I think people are unhappy with someone they are supposed to trust straight up lying to their face.
aws isn’t a hobby platform. businesses are built on aws and other cloud providers. those businesses customers have the expectation of knowing why they are not receiving the full value of their service.
it makes sense that part of marketing yourself as a viable infrastructure upon which other businesses can operate, you’d provide more granular and refined communication to allow better communication up and down the chain instead of forcing your customers to rca your service in order to communicate to their customers.
> I wish we would just throw up a generic "Shit's Fucked Up. We Don't Know Why Yet, But We're Working On It" message.
I think that's the crux of the matter? AWS seems to now have a reputation for ignoring issues that are easily observable by customers, and by the time any update shows up, it's way too late. Whether VPs make this decision or not is irrelevant. If this becomes a known pattern (and I think it has), then the system is broken.
disclaimer: I have very little skin in this game. We use S3 for some static assets, and with layers of caching on top, I think we are rarely affected by outages. I'm still curious to observe major cloud outages and how they are handled, and the HN reaction from people on both side of the fence.
> disclaimer: I have very little skin in this game. We use S3 for some static assets, and with layers of caching on top, I think we are rarely affected by outages. I'm still curious to observe major cloud outages and how they are handled, and the HN reaction from people on both side of the fence.
I'd like to share my experience here. This outage definitely impacted my company. We make heavy use of autoscaling, we use AWS CodeArtifact for Python packages, and we recently adopted AWS Single Sign-On and EC2 Instance Connect.
So, you can guess what happened:
- No one could access the AWS Console.
- No one could access services authenticated with SAML.
- Very few CI/CD, training or data pipelines ran successfully.
- No one could install Python packages.
- No one could access their development VMs.
As you might imagine, we didn't do a whole lot that day.
With that said, this experience is unlikely to change our cloud strategy very much. In an ideal world, outages wouldn't happen, but the reason we use AWS and the cloud in general is so that, when they do happen, we aren't stuck holding the bag.
As others have said, these giant, complex systems are hard, and AWS resolved it in only a few hours! Far better to sit idle for a day rather than spend a few days scrambling, VP breathing down my neck, discovering that we have no disaster recovery mechanism, and we never practiced this, and hardware lead time is 3-5 weeks, and someone introduced a cyclical bootstrapping process, and and and...
Instead, I just took the morning off, trusted the situation would resolve itself, and it did. Can't complain. =P
I might be more unhappy if we had customer SLAs that were now broken, but if that was a concern, we probably should have invested in multi-region or even multi-cloud already. These things happen.
Saying "S3 is down" can mean anything. Our S3 buckets that served static web content stayed up no problem. The API was down though. But for the purposes of whether my organization cares I'm gonna say it was "up".
> We are currently experiencing some problems related to FOO service and are investigating.
A generic, utterly meaningless message, which is still a hell of a lot more than usually gets approved, and approved far too late.
It is also still better than "all green here, nothing to see" which has people looking at their own code, because they _expect_ that they will be the problem, not AWS.
Most of what they actually said via the manual human-language status updates was "Service X is seeing elevated error rates".
While there are still decisions to be made in how you monitor errors and what sorts of elevated rates merit an alert -- I would bet that AWS has internally-facing systems that can display service health in this way based on automated monitoring of error rates (as well as other things). Because they know it means something.
They apparently choose to make their public-facing service health page only show alerts via a manual process that often results in an update only several hours after lots of customers have noticed problems. This seems like a choice.
What's the point of a status page? To me, the point of it is, when I encounter a problem (perhaps noticed because of my own automated monitoring), one of the first thing I want to do is distinguish between a problem that's out of my control on the platform, and a problem that is under my control and I can fix.
A status page that does not support me in doing that is not fulfilling it's purpose. the AWS status page fails to help customers do that, by regularly showing all green with no alerts hours after widespread problems occured.
It doesn’t matter what the VPs are doing, that misses the point. Every minute you know there is a problem and you haven’t at least put up a “degraded” status, you’re lying to your customers.
It was on the top of HN for an hour before anything changed, and then it was still downplayed, which is insane.
I don't think the matter is whether or not VPs are involved, but the fact that human sign off is required. Ideally the dashboard would accurately show what's working or not, regardless if the engineers know what's going on.
There's definitely miscommunication around this. I know I've miscommunicated impact, or my communication was misinterpreted across the 2 or 3 people it had to jump before hitting the status page.
For example, The meaning of "S3 was affected" is subject to a lot of interpretation. STS was down, which is a blocker for accessing S3. So, the end result is S3 is effectively down, but technically it is not. How does one convey this in a large org? You run S3, but not STS, it's not technically an S3 fault, but an integration fault across multiple services. If you say S3 is down, you're implying that the storage layer is down. But it's actually not. What's the best answer to make everyone happy here? I cant think of one.
"S3 is unavailable because X, Y, and Z services are unavailable."
A graph of dependencies between services is surely known to AWS; if not, they ought to create one post-haste.
Trying to externalize Amazon's internal AWS politicking over which service is down is unproductive to the customers who check the dashboard and see that their service ought to be up, but... well, it isn't?
Because those same customers have to explain to their clients and bosses why their systems are malfunctioning, yet it "shows green" on a dashboard somewhere that almost never shows red.
(And I can levy this complaint against Azure too, by the way.)
Yes, I can envision a (simplified) AWS X-Ray dashboard showing the relationships between the systems and the performance of each one. Then we could see at a glance what was going on. Almost anything is better than that wall of text, tiny status images, and RSS feeds.
Later on in the process, you could do something like this. When you know what else is impacted and how that looks to your customers. But by then the problem is most likely over or at least on the way to being fixed. And hours may have gone by before you get to that point.
Early in the process, when you’re flying blind because you don’t know what’s going on around you and you look at your own systems and they appear to be fine, you can’t really say anything useful.
These weird edge cases are hard to adjudicate because they’ve never happened before — otherwise fixes would already be in place to prevent them. And nothing quite like them has ever before happened at this scale.
I understand the frustration, but when everything you think you know turns out to be wrong, or at least you are unable to confirm whether it’s right or wrong, what do you do?
Read the RCA — When AWS got to that point, they did actually update the SHD with a banner across the top of the page, but that ended up actually causing even more problems. There’s a reason why you try to do these sorts of things safely, which may mean using manual methods in some cases. And sometimes even those safe manual methods have their own weird side effects.
Sometime shit is hard. Sometimes you run into problems like no one else on the planet has ever experienced before, and you have to figure out what the laws of physics are in this new part of the world as you go about trying to fix whatever it was that broke or acted in an unexpected manner.
Disclaimer: my opinions are my own and are not necessarily shared or reflective of my employer.
That and the apparent policy that a VP must sign off on changing status pages, which is... backwards to say the least.