I wonder if this is a coincidence or if us-east-1 is simply down enough that I'm...

robrtsql · on June 13, 2023

> No, they say to use AZ's

Using 3 AZs in us-east-1 won't save you.

I guess a demanding customer would have said 'you should have implemented disaster recovery so you could failover to us-east-2' but that's easier said than done. The more regional AWS services you adopt, the bigger the impact is. How does one recover from a regional outage if their pipeline is in that region?

fbellag · on June 13, 2023

What I did once I was in the position of _having_ to provide that level of support, was to run the pipeline in a third region, different from the "prod" ones. That way, worst case you can't do deployments during the outage...

Another alternative studied was to use a thirdparty ci/cd service, outside of our network. It was discarded bc you never know where that would actually run

robrtsql · on June 13, 2023

> It was discarded bc you never know where that would actually run

Yep, I considered that switching to GitHub Actions would _theoretically_ eliminate the need for disaster recovery for CI/CD (since the handling of disasters is out of your hands) but in practice their SLA is far worse than just running CodePipeline in a single region.

fbellag · on June 13, 2023

Yeah, that's why we went with a third region instead. But, at the end of the day, if _only_ changes are affected for a couple of hours, that wouldn't impact the service that much

fnordpiglet · on June 13, 2023

I’ve worked for several systemically important megacorps where certain things had to not only run cross region but also cross provider. It’s absurdly difficult, and only should be done if you need five or more 9’s of availability. Almost nothing actually does.

kobalsky · on June 13, 2023

it's important to inform customers about the resiliency of their systems and let them pick how far they are going to invest for it.

then you get to eat popcorn when stuff explodes.

  * single server event.   $
  * multi server event.    $$
  * single az event.       $$$
  * multi az event.        $$$$
  * global provider event. $$$$$
  * cross provider event.  $$$$$$
  * alien invasion.        $$$$$$$$$$$$$$

matwood · on June 13, 2023

Back when we had servers in an onsite DC we lost a raid card and the system I was developing went down. We had the fancy support so a tech was out with the card replaced in a couple hours, then we had to restore from tape backup. All in all, a non-critical system was down for most of a business day. My bosses boss stormed in, upset he couldn't pull a report, and asked how do we prevent this in the future. I responded at a minimum we had to double the cost for a hot standby, and he said 'never mind' and walked out.

d8dUf7KjBEYCk7Q · on June 13, 2023

That sounds nice. My boss's boss is usually the one storming in, and he usually says "okay let's do it", and then I have to implement it in a week...

bombcar · on June 13, 2023

That’s why you always BofH the estimates to include some fun toys for yourself, too

bell-cot · on June 13, 2023

"Briefly describe the '$$$$$$$' through '$$$$$$$$$$$$$' situations. Can't leave money lying on the table."

- memo from Enterprise Sales Dept.

mjevans · on June 13, 2023

Alien invasion resistant is spy novel / agents of shield level of multiple, redundant, isolated, off the normal books, safehouses + bases.

Short of alien invasion level are strategic military resistance levels to global/regional wars with differing levels of weapons and devastation.

xyst · on June 13, 2023

Just need to deploy your service on Mars AND Earth. Duh

elzbardico · on June 14, 2023

Multi-planetary-AZ DB cluster deployments. Putting the emphasis on "Eventual" in eventual consistency. Go for a walk before retrying reading from this replica!

dmckeon · on June 13, 2023

And you thought the current time zone confusion was bad. Now you have two sets of time zones, and a varying delay of about 5 to 21 minutes between them. Oh, the joy!

kobalsky · on June 13, 2023

note to self: synchronous replication may be a problem

Endy · on June 13, 2023

Always be prepared for alien invasion

underbluewaters · on June 13, 2023

This should be logarithmic

svieira · on June 13, 2023

The nice thing is that any graph without a unit can be log-scale - so in a way, it already is.

Johnny555 · on June 13, 2023

My employer lets customers choose which of our supported regions to run in and exempts cloud provider outages from our SLA (we’re on the hook for staying up for single AZ outages, but not multi AZ or region outages). We provide tools to help customers replicate their data so they can be multi-region or even multi provider if they want to.

kinghajj · on June 13, 2023

AZs don't really help when it's AWS' own services across the entire region that break. Anecdotally, we have had customers accept outages that were out of our control without penalty.

dijit · on June 13, 2023

Wild, that wouldn't have flown with datacenter providers having issues for my previous companies.

AWS really does have an easier time than old school datacenter providers. I guess the complexity is higher but it's shocking that they can charge so much yet we hold them to a lower standard.

gtirloni · on June 13, 2023

DCs are pretty static and offer way fewer services than AWS or any other public cloud.

I worked for one for some time and whenever we had issues, some people would call and ask if we were going bankrupt. It gave me a feeling they also have way smaller customers that might not understand the underlying stack.

aeyes · on June 13, 2023

If all you use in AWS is static EC2 instances you would have to go back a looooong time to find an outage which affected their availability. Even in us-east-1.

TonyCoffman · on June 13, 2023

December 22, 2021 was the last partial impact we had in us-east-1 for EC2 instances. They had power issues in USE1-AZ4 that took a while to sort out.

nostrebored · on June 13, 2023

Outage rates are also wildly different. When you're using dozens of managed services and have a few prod-impacting outages with any reasonable (cross-AZ) design, customers are less sensitive then when they are dependent on dozens of products that hav independent failure modes with potentially cascading impact.

vultour · on June 13, 2023

AZs also don't help with natural disasters at all. I believe AWS is the only one doing geographically distributed AZs, for the others it just means different connections and placed somewhere else in the building.

edit: turns out AWS is the one with geo distribution, not Azure

kapilvt · on June 13, 2023

Aws azs are also distributed geographically within a region w separate power and network lines. From the docs “ Availability Zones are distinct locations within an AWS Region that are engineered to be isolated from failures in other Availability Zones.”

vultour · on June 13, 2023

Ah, you are probably right. I was thinking of the incident a few weeks back where the fire suppression took out multiple AZs, but that was actually GCP.

paulddraper · on June 13, 2023

Depends on your customers.

If your customers are tech, they're too busy running around with their hair on fire too.

tedmiston · on June 14, 2023

> Has anyone ever actually had customers accept an outage because AWS was down...

Whether customers "accept" it or not just comes down to what's in your SLA, if you have one in the first place, and if they are on a contract tier that it applies to. [Many servies provide no SLA for hobby / low tiers, beta features, etc.]

Firebase Auth, for instance, offers no SLA at all [1].

I would be curious to see statistics across a range of SLAs for what % include a force majeure or similar clause which excludes responsibility for upstream outages. I would expect this to be more common with more technical products / more technical customers.

[1]: https://stackoverflow.com/a/60500860/149428

mrobins · on June 13, 2023

I can think of more times where a whole AZ has had issues than times where just one AZ went dark and failover happened seamlessly.

paulddraper · on June 13, 2023

s/whole AZ/whole region/

jmacjmac · on June 13, 2023

Maybe cheaper regions have more users and have higher outage rates