I wonder if this is a coincidence or if us-east-1 is simply down enough that I'm just experiencing selection bias; but I posted a poll on twitter earlier today: https://twitter.com/dijit/status/1668678588713824257
Contents:
> Has anyone ever actually had customers accept an outage because AWS was down; or is this just cloud evangelicalism copium?
I guess a demanding customer would have said 'you should have implemented disaster recovery so you could failover to us-east-2' but that's easier said than done. The more regional AWS services you adopt, the bigger the impact is. How does one recover from a regional outage if their pipeline is in that region?
What I did once I was in the position of _having_ to provide that level of support, was to run the pipeline in a third region, different from the "prod" ones. That way, worst case you can't do deployments during the outage...
Another alternative studied was to use a thirdparty ci/cd service, outside of our network. It was discarded bc you never know where that would actually run
> It was discarded bc you never know where that would actually run
Yep, I considered that switching to GitHub Actions would _theoretically_ eliminate the need for disaster recovery for CI/CD (since the handling of disasters is out of your hands) but in practice their SLA is far worse than just running CodePipeline in a single region.
Yeah, that's why we went with a third region instead. But, at the end of the day, if _only_ changes are affected for a couple of hours, that wouldn't impact the service that much
I’ve worked for several systemically important megacorps where certain things had to not only run cross region but also cross provider. It’s absurdly difficult, and only should be done if you need five or more 9’s of availability. Almost nothing actually does.
it's important to inform customers about the resiliency of their systems and let them pick how far they are going to invest for it.
then you get to eat popcorn when stuff explodes.
* single server event. $
* multi server event. $$
* single az event. $$$
* multi az event. $$$$
* global provider event. $$$$$
* cross provider event. $$$$$$
* alien invasion. $$$$$$$$$$$$$$
Back when we had servers in an onsite DC we lost a raid card and the system I was developing went down. We had the fancy support so a tech was out with the card replaced in a couple hours, then we had to restore from tape backup. All in all, a non-critical system was down for most of a business day. My bosses boss stormed in, upset he couldn't pull a report, and asked how do we prevent this in the future. I responded at a minimum we had to double the cost for a hot standby, and he said 'never mind' and walked out.
Multi-planetary-AZ DB cluster deployments. Putting the emphasis on "Eventual" in eventual consistency.
Go for a walk before retrying reading from this replica!
And you thought the current time zone confusion was bad. Now you have two sets of time zones, and a varying delay of about 5 to 21 minutes between them. Oh, the joy!
My employer lets customers choose which of our supported regions to run in and exempts cloud provider outages from our SLA (we’re on the hook for staying up for single AZ outages, but not multi AZ or region outages). We provide tools to help customers replicate their data so they can be multi-region or even multi provider if they want to.
AZs don't really help when it's AWS' own services across the entire region that break. Anecdotally, we have had customers accept outages that were out of our control without penalty.
Wild, that wouldn't have flown with datacenter providers having issues for my previous companies.
AWS really does have an easier time than old school datacenter providers. I guess the complexity is higher but it's shocking that they can charge so much yet we hold them to a lower standard.
DCs are pretty static and offer way fewer services than AWS or any other public cloud.
I worked for one for some time and whenever we had issues, some people would call and ask if we were going bankrupt. It gave me a feeling they also have way smaller customers that might not understand the underlying stack.
If all you use in AWS is static EC2 instances you would have to go back a looooong time to find an outage which affected their availability. Even in us-east-1.
Outage rates are also wildly different. When you're using dozens of managed services and have a few prod-impacting outages with any reasonable (cross-AZ) design, customers are less sensitive then when they are dependent on dozens of products that hav independent failure modes with potentially cascading impact.
AZs also don't help with natural disasters at all. I believe AWS is the only one doing geographically distributed AZs, for the others it just means different connections and placed somewhere else in the building.
edit: turns out AWS is the one with geo distribution, not Azure
Aws azs are also distributed geographically within a region w separate power and network lines. From the docs “ Availability Zones are distinct locations within an AWS Region that are engineered to be isolated from failures in other Availability Zones.”
Ah, you are probably right. I was thinking of the incident a few weeks back where the fire suppression took out multiple AZs, but that was actually GCP.
> Has anyone ever actually had customers accept an outage because AWS was down...
Whether customers "accept" it or not just comes down to what's in your SLA, if you have one in the first place, and if they are on a contract tier that it applies to. [Many servies provide no SLA for hobby / low tiers, beta features, etc.]
Firebase Auth, for instance, offers no SLA at all [1].
I would be curious to see statistics across a range of SLAs for what % include a force majeure or similar clause which excludes responsibility for upstream outages. I would expect this to be more common with more technical products / more technical customers.
Contents:
> Has anyone ever actually had customers accept an outage because AWS was down; or is this just cloud evangelicalism copium?
> [ ] Yeah, outages free pass
> [ ] No, they say to use AZ's