At the large platform company where I work, our policy is if the customer reported the issue before our internal monitoring caught it, we have failed. Give 5 minutes for alerting lag, 10 minutes to evaluate the magnitude of impact, 10 minutes to craft the content and get it approved, 5 minutes to execute the update, adds up to 30 minutes end to end with healthy buffer at each step.
1 hour (52 minutes according to the article) sounds meh. I wonder what their error rate and latency graphs look like from that day.
1 hour (52 minutes according to the article) sounds meh. I wonder what their error rate and latency graphs look like from that day.