No, it was about an hour. We were aware from the very moment EC2 API error rates...

CoffeeOnWrite · on Dec 11, 2021

At the large platform company where I work, our policy is if the customer reported the issue before our internal monitoring caught it, we have failed. Give 5 minutes for alerting lag, 10 minutes to evaluate the magnitude of impact, 10 minutes to craft the content and get it approved, 5 minutes to execute the update, adds up to 30 minutes end to end with healthy buffer at each step.

1 hour (52 minutes according to the article) sounds meh. I wonder what their error rate and latency graphs look like from that day.

Aperocky · on Dec 11, 2021

> our policy is if the customer reported the issue before our internal monitoring caught it

They've discovered it right away, the Service Health Dashboard was not updated. source: link.

CoffeeOnWrite · on Dec 11, 2021

They don’t say explicitly right away do they? I skimmed twice.

But yes you’re right, there’s no reason to question their monitoring or alerting specifically.