Hacker News new | past | comments | ask | show | jobs | submit login

I wonder if this is a coincidence or if us-east-1 is simply down enough that I'm just experiencing selection bias; but I posted a poll on twitter earlier today: https://twitter.com/dijit/status/1668678588713824257

Contents:

> Has anyone ever actually had customers accept an outage because AWS was down; or is this just cloud evangelicalism copium?

> [ ] Yeah, outages free pass

> [ ] No, they say to use AZ's




> No, they say to use AZ's

Using 3 AZs in us-east-1 won't save you.

I guess a demanding customer would have said 'you should have implemented disaster recovery so you could failover to us-east-2' but that's easier said than done. The more regional AWS services you adopt, the bigger the impact is. How does one recover from a regional outage if their pipeline is in that region?


What I did once I was in the position of _having_ to provide that level of support, was to run the pipeline in a third region, different from the "prod" ones. That way, worst case you can't do deployments during the outage...

Another alternative studied was to use a thirdparty ci/cd service, outside of our network. It was discarded bc you never know where that would actually run


> It was discarded bc you never know where that would actually run

Yep, I considered that switching to GitHub Actions would _theoretically_ eliminate the need for disaster recovery for CI/CD (since the handling of disasters is out of your hands) but in practice their SLA is far worse than just running CodePipeline in a single region.


Yeah, that's why we went with a third region instead. But, at the end of the day, if _only_ changes are affected for a couple of hours, that wouldn't impact the service that much


I’ve worked for several systemically important megacorps where certain things had to not only run cross region but also cross provider. It’s absurdly difficult, and only should be done if you need five or more 9’s of availability. Almost nothing actually does.


it's important to inform customers about the resiliency of their systems and let them pick how far they are going to invest for it.

then you get to eat popcorn when stuff explodes.

  * single server event.   $
  * multi server event.    $$
  * single az event.       $$$
  * multi az event.        $$$$
  * global provider event. $$$$$
  * cross provider event.  $$$$$$
  * alien invasion.        $$$$$$$$$$$$$$


Back when we had servers in an onsite DC we lost a raid card and the system I was developing went down. We had the fancy support so a tech was out with the card replaced in a couple hours, then we had to restore from tape backup. All in all, a non-critical system was down for most of a business day. My bosses boss stormed in, upset he couldn't pull a report, and asked how do we prevent this in the future. I responded at a minimum we had to double the cost for a hot standby, and he said 'never mind' and walked out.


That sounds nice. My boss's boss is usually the one storming in, and he usually says "okay let's do it", and then I have to implement it in a week...


That’s why you always BofH the estimates to include some fun toys for yourself, too


"Briefly describe the '$$$$$$$' through '$$$$$$$$$$$$$' situations. Can't leave money lying on the table."

- memo from Enterprise Sales Dept.


Alien invasion resistant is spy novel / agents of shield level of multiple, redundant, isolated, off the normal books, safehouses + bases.

Short of alien invasion level are strategic military resistance levels to global/regional wars with differing levels of weapons and devastation.


Just need to deploy your service on Mars AND Earth. Duh


Multi-planetary-AZ DB cluster deployments. Putting the emphasis on "Eventual" in eventual consistency. Go for a walk before retrying reading from this replica!


And you thought the current time zone confusion was bad. Now you have two sets of time zones, and a varying delay of about 5 to 21 minutes between them. Oh, the joy!


note to self: synchronous replication may be a problem


Always be prepared for alien invasion


This should be logarithmic


The nice thing is that any graph without a unit can be log-scale - so in a way, it already is.


My employer lets customers choose which of our supported regions to run in and exempts cloud provider outages from our SLA (we’re on the hook for staying up for single AZ outages, but not multi AZ or region outages). We provide tools to help customers replicate their data so they can be multi-region or even multi provider if they want to.


AZs don't really help when it's AWS' own services across the entire region that break. Anecdotally, we have had customers accept outages that were out of our control without penalty.


Wild, that wouldn't have flown with datacenter providers having issues for my previous companies.

AWS really does have an easier time than old school datacenter providers. I guess the complexity is higher but it's shocking that they can charge so much yet we hold them to a lower standard.


DCs are pretty static and offer way fewer services than AWS or any other public cloud.

I worked for one for some time and whenever we had issues, some people would call and ask if we were going bankrupt. It gave me a feeling they also have way smaller customers that might not understand the underlying stack.


If all you use in AWS is static EC2 instances you would have to go back a looooong time to find an outage which affected their availability. Even in us-east-1.


December 22, 2021 was the last partial impact we had in us-east-1 for EC2 instances. They had power issues in USE1-AZ4 that took a while to sort out.


Outage rates are also wildly different. When you're using dozens of managed services and have a few prod-impacting outages with any reasonable (cross-AZ) design, customers are less sensitive then when they are dependent on dozens of products that hav independent failure modes with potentially cascading impact.


AZs also don't help with natural disasters at all. I believe AWS is the only one doing geographically distributed AZs, for the others it just means different connections and placed somewhere else in the building.

edit: turns out AWS is the one with geo distribution, not Azure


Aws azs are also distributed geographically within a region w separate power and network lines. From the docs “ Availability Zones are distinct locations within an AWS Region that are engineered to be isolated from failures in other Availability Zones.”


Ah, you are probably right. I was thinking of the incident a few weeks back where the fire suppression took out multiple AZs, but that was actually GCP.


Depends on your customers.

If your customers are tech, they're too busy running around with their hair on fire too.


> Has anyone ever actually had customers accept an outage because AWS was down...

Whether customers "accept" it or not just comes down to what's in your SLA, if you have one in the first place, and if they are on a contract tier that it applies to. [Many servies provide no SLA for hobby / low tiers, beta features, etc.]

Firebase Auth, for instance, offers no SLA at all [1].

I would be curious to see statistics across a range of SLAs for what % include a force majeure or similar clause which excludes responsibility for upstream outages. I would expect this to be more common with more technical products / more technical customers.

[1]: https://stackoverflow.com/a/60500860/149428


I can think of more times where a whole AZ has had issues than times where just one AZ went dark and failover happened seamlessly.


s/whole AZ/whole region/


Maybe cheaper regions have more users and have higher outage rates




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: