The complexity that AWS has to deal with is astounding. Sure having your main production network and a management network is common. But making sure all of it scales and doesn't bring down the other is what I think they are dealing with here.
It must have been crazy hard to troubleshoot when you are flying blind because all your monitoring is unresponsive. Clearly more isolation with clearly delineated information exchange points are needed.
“But AWS has more operations staff than I would ever hope to hire” — a common mantra when talking about using the cloud overall.
I’m not saying I fully disagree. But consolidation of the worlds hosting necessitates a very complicated platform and these things will happen, either due to that complexity, failures that can’t be foreseen or good old fashioned Sod’s law.
I know AWS marketing wants you to believe it’s all magic and rainbows, but it’s still computers.
I work for one of the Big 3 cloud providers and it’s always interesting when giving RCAs to customers. The vast majority of our incidents are due to bugs in the “magic” components that allow us to operate at such a massive scale.
It must have been crazy hard to troubleshoot when you are flying blind because all your monitoring is unresponsive. Clearly more isolation with clearly delineated information exchange points are needed.