"Our Support Contact Center also relies on the internal AWS network, so the ability to create support cases was impacted from 7:33 AM until 2:25 PM PST. "
This to me is really bad. Even as a small company, we keep our support infrastructure separate. For a company of Amazon's size, this is a shitty excuse. If I cannot even reach you as a customer for almost 7 hours, that is just nuts. AWS must do better here.
Also, is it true that the outage/status pages are manually updated ? If yes, there is no excuse why it was green for that long. If you are manually updating it, please update asap.
I know a few tiny ISPs that host their voip server and email server outside of their own ASN so that in the event of a catastrophic network event, communications with customers is still possible... Not saying amazon should do the same, but the general principle isn't rocket science.
We moved our company's support call system to Microsoft Teams when lockdowns were happening, and even that was affected by the AWS outage (along with our SaaS product hosted on AWS).
It turned out our call center supplier had something running on AWS, and it took out our entire phone system. After this situation settles, I'm tempted to ask my supplier to see what they're doing to get around this in the future, but I doubt even they knew that AWS was used further downstream.
AWS operates a lot like Amazon.com, the marketplace now--you can try to escape it, but it's near impossible. If you want to ban usage of Amazon's services, you're going to find some service (AWS) or even a Shopify site (FBA warehouse) who uses it.
Wasn't this the Bezos directive early on that created AWS? Anything that was created had to be a service with an API. Not allowed to recreate the wheel. So AWS depends on AWS.
My favourite is when some company migrates their physical servers to virtual machines, including the AD domain controllers. Then the next step is to use AD LDAP authentication for the VM management software.
When there's a temporary outage and the VMs don't start up as expected, the admins can't log on and troubleshoot the platform because the logon system was running on it... but isn't now.
The loop is closed.
You see this all the time, especially with system-management software. They become dependent on the systems they're managing, and vice-versa.
If you care about availability at all, make sure to have physical servers providing basic services like DNS, NTP, LDAP, RADIUS, etc...
This to me is really bad. Even as a small company, we keep our support infrastructure separate. For a company of Amazon's size, this is a shitty excuse. If I cannot even reach you as a customer for almost 7 hours, that is just nuts. AWS must do better here.
Also, is it true that the outage/status pages are manually updated ? If yes, there is no excuse why it was green for that long. If you are manually updating it, please update asap.