Off the top of my head, this is the third time they've had a major outage where they've been unable to properly update the status page. First we had the S3 outage, where the yellow and red icons were hosted in S3 and unable to be accessed. Second we had the Kinesis outage, which snowballed into a Cognito outage, so they were unable to login into the status page CMS. Now this.
They "own up to it" in their postmortems, but after multiple failures they're still unwilling to implement the obvious solution and what is widely regarded as best practice: host the status page on a different platform.
Firmly agreed. I've heard AWS discuss making the status page better – but they get really quiet about actually doing it. In my experience the best/only way to check for problems is to search Twitter for your AWS region name.
Maybe AWS should host their status checks in Azure and vice versa ... Mutually Assured Monitoring :) Otherwise it becomes a problem of who will monitor the monitor
My company is quite well known for blameless post-mortems, but if someone failed to implement improvements after three subsequent outages, they would be moved to a position more appropriate for their skills.
That’s not what’s being asked though - in all 3 events, they couldn’t manually update it. It’s clearly not a priority to fix it for even manual alerts.
>Be capable of spinning up virtualized instances (including custom drive configurations, network stacks, complex routing schemes, even GPUs) with a simple API call
But,
>Be incapable of querying the status of such things
As others mention, you can do it manually. But it’s also not that hard to do automatically: literally just spin up a “client” of your service and make sure it works.
Eh, the colored icons not loading is not really the same thing as incorrectly reporting that nothing’s wrong. Putting the status page on separate infra would be good practice, though.
They "own up to it" in their postmortems, but after multiple failures they're still unwilling to implement the obvious solution and what is widely regarded as best practice: host the status page on a different platform.