Hacker News new | past | comments | ask | show | jobs | submit login

> The entire time their outage board was solid green

Unless you're talking about some board other than the Service Health Dashboard, this isn't true. They dropped EC2 down to degraded pretty early on. I bemusedly noted in our corporate Slack that every time I refreshed the SHD, another service was listed as degraded. Then they added the giant banner at the top. Their slight delay in updating the SHD at the beginning of the outage is mentioned in the article. It was absolutely not all green for the duration of the outage.




That is not true. There was hours before they started annotating any kind of service issues. Maybe from when you noticed there was a problem it appeared to be quick, but the board remained green for a large portion of the outtage.


No, it was about an hour. We were aware from the very moment EC2 API error rates began to elevate, around 10:30 Eastern. By 11:30 the dashboard was updating. This timing is mentioned in the article, and it all happened in the middle of our workday on the east coast. The outage then continued for about 7 hours with SHD updates. I suspect we actually both agree on how long it took them to start updating, but I conclude that 1 hour wasn't so bad.


At the large platform company where I work, our policy is if the customer reported the issue before our internal monitoring caught it, we have failed. Give 5 minutes for alerting lag, 10 minutes to evaluate the magnitude of impact, 10 minutes to craft the content and get it approved, 5 minutes to execute the update, adds up to 30 minutes end to end with healthy buffer at each step.

1 hour (52 minutes according to the article) sounds meh. I wonder what their error rate and latency graphs look like from that day.


> our policy is if the customer reported the issue before our internal monitoring caught it

They've discovered it right away, the Service Health Dashboard was not updated. source: link.


They don’t say explicitly right away do they? I skimmed twice.

But yes you’re right, there’s no reason to question their monitoring or alerting specifically.


We saw the timing described where the dashboard updates started about an hour after the problem began (which we noticed immediately since 7:30AM Pacific is in the middle of the day for those of us in Eastern time). I don't know if there was an issue with browser caching or similar but once the updates started everyone here had no trouble seeing them and my RSS feed monitor picked them up around that time as well.


Multiple services I use were totally skunked, and none were ever anything but green.

Sagemaker, for example, was down all day. I was dead in the water on a modeling project that required GPUs. It relied on EC2, but nobody there even thought to update the status? WTF. This is clearly executives incentivized to let a bug persist. This is because the bug is actually a feature for misleading customers and maximizing profits.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: