AWS has similarly suffered outages from an entire datacenter being taken out lik...

bushbaba · on April 26, 2023

Eh source for that. AWS has had issues where a single Zone caused such a lack of capacity in the region that some multi-zone services degraded to the point of a domino fail-over. However I've not heard of any AWS event where a fire/flood in AZ A also caused a fire/flood in AZ B.

martius · on April 26, 2023

But does it really matter that the incident is a flood or a cascading software failure if the likelihood and severity is the same?

Being in the same building is an "implementation detail" from a customer perspective, what matters is the consequences of this decision.

For example, maybe this decision allows for better network connectivity at a lower cost for inter-zones traffic, while, on the other hand, not protecting against some classes of risks.

In the end, you can have a similar multi-zone outage keeping the region down for an extended period of time just because of a bad network config push (see the massive facebook outage in 2021). As a customer, I don't care if it's a flood or a network outage.

Imho, what matters the most is a clear documentation of how these abstractions work for users and the corresponding contractual agreements (costs, SLAs, etc). Users can thus decide if they are ready to pay the price of protecting themselves against an extended outage impacting a single region.

flaminHotSpeedo · on April 26, 2023

It absolutely does matter.

The MTTR for outages caused by physical damage is way higher, and resiliency against physical disasters is a major selling point of availability zones as a fault container.

Hosting every zone of your region (if that's actually the case here) in the same building is simply negligent.

Besides the obvious risks like this incident, even if the zones have physical fire barriers, chances that operators will be allowed in to one "zone" after another has a fire are slim to none.

martius · on April 27, 2023

True, I implicitly included the MTTR in the "severity", but this is actually a different thing (severity is more about the impact radius).

But I don't think it changes my point: knowing what/how Google Cloud designs regions or zones is still an implementation detail, what matters is what MTTR they are targeting and this should be known ahead of time.

There are so many "implementation details" that customers are not aware of, because they are always changing, non contractual, or just hard to make sense of, what matters is meaningful abstractions.

I am not saying it's OK if the zones are in the same building or not, I don't know and I was really surprised when I discovered this a few years ago. But this information gives you a mental model of "what could go wrong" that is biased towards some specific risks, and in my experience, relying on these very practical aspects make the risk analysis and design decisions harder to make.

Otho, one thing that may be problematic too (and biasing) is that the common understood definition of a "zone" is the one people know from AWS, so using the same term without being very explicit about the differences will also lead to incorrectly calculated risks. I find the public documentation of Google Cloud too vague in general (and often ambiguous).

flaminHotSpeedo · on April 28, 2023

Scale of impact, scope of impact, and duration of impact are orthogonal. Conflating them makes productive discussion impossible, IMO.

But back to the point, philosophically I agree, but practically I don't. IMO having SLA's and enforceable guarantees that give customers the information they need is much harder than exposing the implementation details.

"Zones within a region may be located in the same building" is much more concise than SLA's using contractual language, and probably conveys more (though potentially less accurate) information once I apply my context.

Also, if we look GCP's SLA's, this outage blew the SLA breach threshold out of the water for many services. Some are pushing 2 9's of downtime from this incident alone.

Finally (in hindsight maybe I should have led with this, but I'm too lazy to restructure this comment), SLA's are a joke. Outages can destroy your business, but all you get from your cloud provider is that they comp you for usually a small fraction of what they charge you. They have no teeth, so if you can't just write off a major outage you have to have a plan to avoid it, which means you need to know the implementation details

traderj0e · on April 26, 2023

Seems the likelihood isn't the same. AWS is separating AZs physically, GCP is not. I'd want to know this as a customer, not some abstraction.

CydeWeys · on April 26, 2023

It sounds like this might just be confusion over nomenclature, with Google and Amazon using different terms for the same thing.

Regardless, with GCP, if you need redundancy that can survive the loss of an entire datacenter, then you need to be multi-regional. This has been widely known best practice for a long time.