Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

No, it was about an hour. We were aware from the very moment EC2 API error rates began to elevate, around 10:30 Eastern. By 11:30 the dashboard was updating. This timing is mentioned in the article, and it all happened in the middle of our workday on the east coast. The outage then continued for about 7 hours with SHD updates. I suspect we actually both agree on how long it took them to start updating, but I conclude that 1 hour wasn't so bad.


At the large platform company where I work, our policy is if the customer reported the issue before our internal monitoring caught it, we have failed. Give 5 minutes for alerting lag, 10 minutes to evaluate the magnitude of impact, 10 minutes to craft the content and get it approved, 5 minutes to execute the update, adds up to 30 minutes end to end with healthy buffer at each step.

1 hour (52 minutes according to the article) sounds meh. I wonder what their error rate and latency graphs look like from that day.


> our policy is if the customer reported the issue before our internal monitoring caught it

They've discovered it right away, the Service Health Dashboard was not updated. source: link.


They don’t say explicitly right away do they? I skimmed twice.

But yes you’re right, there’s no reason to question their monitoring or alerting specifically.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: