> The entire time their outage board was solid green Unless you're talking about...

logical_proof · on Dec 10, 2021

That is not true. There was hours before they started annotating any kind of service issues. Maybe from when you noticed there was a problem it appeared to be quick, but the board remained green for a large portion of the outtage.

electroly · on Dec 11, 2021

No, it was about an hour. We were aware from the very moment EC2 API error rates began to elevate, around 10:30 Eastern. By 11:30 the dashboard was updating. This timing is mentioned in the article, and it all happened in the middle of our workday on the east coast. The outage then continued for about 7 hours with SHD updates. I suspect we actually both agree on how long it took them to start updating, but I conclude that 1 hour wasn't so bad.

CoffeeOnWrite · on Dec 11, 2021

At the large platform company where I work, our policy is if the customer reported the issue before our internal monitoring caught it, we have failed. Give 5 minutes for alerting lag, 10 minutes to evaluate the magnitude of impact, 10 minutes to craft the content and get it approved, 5 minutes to execute the update, adds up to 30 minutes end to end with healthy buffer at each step.

1 hour (52 minutes according to the article) sounds meh. I wonder what their error rate and latency graphs look like from that day.

Aperocky · on Dec 11, 2021

> our policy is if the customer reported the issue before our internal monitoring caught it

They've discovered it right away, the Service Health Dashboard was not updated. source: link.

CoffeeOnWrite · on Dec 11, 2021

They don’t say explicitly right away do they? I skimmed twice.

But yes you’re right, there’s no reason to question their monitoring or alerting specifically.

acdha · on Dec 10, 2021

We saw the timing described where the dashboard updates started about an hour after the problem began (which we noticed immediately since 7:30AM Pacific is in the middle of the day for those of us in Eastern time). I don't know if there was an issue with browser caching or similar but once the updates started everyone here had no trouble seeing them and my RSS feed monitor picked them up around that time as well.

JPKab · on Dec 11, 2021

Multiple services I use were totally skunked, and none were ever anything but green.

Sagemaker, for example, was down all day. I was dead in the water on a modeling project that required GPUs. It relied on EC2, but nobody there even thought to update the status? WTF. This is clearly executives incentivized to let a bug persist. This is because the bug is actually a feature for misleading customers and maximizing profits.