Hacker News new | past | comments | ask | show | jobs | submit login

I’ve been running platform teams on aws now for 10 years, and working in aws for 13. For anyone looking for guidance on how to avoid this, here’s the advice I give startups I advise.

First, if you can, avoid us-east-1. Yes, you’ll miss new features, but it’s also the least stable region.

Second, go multi AZ for production workloads. Safety of your customer’s data is your ethical responsibility. Protect it, back it up, keep it as generally available as is reasonable.

Third, you’re gonna go down when the cloud goes down. Not much use getting overly bent out of shape. You can reduce your exposure by just using their core systems (EC2, S3, SQS, LBs, Cloudfrount, RDS, Elasticache). The more systems you use, the less reliable things will be. However, running your own key value store, api gateway, event bud, etc., can also be way less reliable than using their’s. So, realize it’s an operational trade off.

Degradation of your app / platform is more likely to come from you than AWS. You’re gonna roll out bad code, break your own infra, overload your own system, way more often than Amazon is gonna go down. If reliability matters to you, start by examining your own practices first before thinking things like multi region or super durable highly replicated systems.

This stuff is hard. It’s hard for Amazon engineers. Hard for platform folks at small and mega companies. It’s just, hard. When your app goes down, and so does Disney plus, take some solace that Disney in all their buckets of cash also couldn’t avoid the issue.

And, finally, hold cloud providers accountable. If they’re unstable and not providing service you expect, leave. We’ve got tons of great options these days, especially if you don’t care about proprietary solutions.

Good luck y’all!




Easy to say leave, the techinical lockin cloud service providers by design choose to have makes it impossible to leave .

AWS (and others) make egress costs insanely expensive for any startup to consider leaving with their data, also there is constant push to either not support open protocols or extend /expand them in ways making it hard to migrate a code base easily.

If the advise is to use only effectively use managed open source components then why AWS at all ? most competent mid sized teams can do that much cheaper with a colo providers like OVH/hetzner.

The point of investing in AWS is not outsource running base infra, if we should stay away from leveraging the kind of cloud native services us mere mortals cannot hope to build or maintain.

Also this avoid us-east-1 advice is bit frustrating, AWS does not have to experiment with new services always in the same region,it is not marked as experimental region or has reduced SLAs , if it is inferior/preview/beta than call it out in the UI and contract, what about when there is no choice? If cloudfront is managed in us-east-1 and we shouldnt now use it ? Why use the cloud then ?

if your engineering only discovers scale problems at us-east-1 along with customers perhaps something is wrong ? aws could limit new instances in that region and spread the load, playing with customers like this who are at your mercy just because you can is not nice.

Disney can afford to go down, or build their cloud, small companies don't have deep pockets to do either


> AWS (and others) make egress costs insanely expensive for any startup to consider leaving with their data

I have seen this repeated many times, but don't understand it. Yes egress is expensive, but they are not THAT expensive compared to storage. S3 egress per GB is no more than 3x the price of storage, i.e. moving out just cost 3 month of storage cost (there's also API cost but that's not the one often mentioned).

Is egress pricing being a lock-in factor just a myth? Is there some other AWS cost I'm missing? Obviously there will be big architectural and engineering cost to move, but that's just part of life.


Often the other cloud vendors will assist in offering those migration costs as part of your contract negotiations.

But really, egress costs aren’t locking you in. It’s the hard coded AWS apis, terraform scripts and technical debt. Having to change all of that and refactor and reoptimize to a different providers infrastructure is a huge endeavor. That time spent night have a higher ROI being put elsewhere


3months is only if you use standard S3, However intelligent tiering , infrequent access , reduced redundancy or glacier instant can be substantially cheaper, without impacting retrieval time [1]

At scale when costs matter, you would have lifecycle policy tuned to your needs taking advantage of these classes. Any typical production workload is hardly paying only S3 base price for all/most of its storage needs, they will have mix of all these too.

[1] if there is substantial data in glacier regular, the costing completely blows through the roof, retrieval +egress makes it infeasible unless you activily hate AWS enough to spend that kind of money


Lesson to build your services with Docker and Terraform. In this setup you can spin up a working clone of a decently sized stack in a different cloud provider in under an hour.

Don't lock yourself in.


If the setup is that portable you probably don't need the AWS at all in the first place.

If your use only services built and managed by your docker images why use the cloud in the first place ? It would be cheaper to host on a smaller vendor , the reliability is not substantially better with big cloud than tier two vendors, that difference between say OVH and AWS is not that valuable to most applications to be worth the premium.

In IMO, if you don't leverage cloud native services offered by GCP or AWS then cloud is not adding much value to your stack.


This is just not true for Terraform at all, they do not aim to be multi cloud and it is a much more usable product because of it. Resource parameters do not swap out directly across providers (rightly so, the abstractions they choose are different!).


...if you don't have much data, that is. Otherwise, you'll have huge egress costs.


You've written up my thoughts better than I can express them myself - I think what people get really stuck on when something like this happens is the 'can I solve this myself?' aspect.

A wait for X provider to fix it for you situation is infinitely more stressful than an 'I have played myself, I will now take action' situation.

Situations out of your (immediate) resolution control feel infinitely worse, even if the customer impact in practice of your fault vs cloud fault is the same.


For me it’s the opposite… aws outages are much less stressful than my own because I know there’s nothing I/we can do about it, they have smart people working on it, and it will be fixed when it’s fixed


I couldn't possibly disagree more strongly with this. I used to drive frantically to the office to work on servers in emergency situations, and if our small team couldn't solve it, there was nobody else to help us. The weight of the outage was entirely on our shoulders. Now I relax and refresh a status page.


> Third, you’re gonna go down when the cloud goes down.

Not necessarily. You just need to not be stuck with a single cloud provider. The likelihood of more than one availability zone going down on a single cloud provider is not that low in practice. Especially when the problem is a software bug.

The likelihood of AWS, Azure, and OVH going down at the same time is low. So if you need to stay online if AWS fail, don't put all your eggs in the AWS basket.

That means not using proprietary cloud solutions from a single cloud provider, it has a cost so it's not always worth it.


> using proprietary cloud solutions from a single cloud provider, it has a cost so it's not always worth it.

but perhaps some software design choices could be made to alleviate these costs. For example, you could have a read-only replica on azure or whatever backup cloud provider, and design your software interfaces to allow the use of such read only replicas - at least you'd be degraded rather than unavailable. Ditto with web servers etc.

This has a cost, but it's lower than entirely replicating all of the proprietary features in a different cloud.


Complex systems are expensive to operate, in many ways.

The more complexity you build into your own systems on top of the providers you depend on, the more likely you are to shoot yourself in the foot when you run into complexity issues that you’ve never seen before.

And the times that is most likely to happen is when one of your complex service providers goes down.

If the kind of thing you’re talking about could be feasibly done, then Netflix would have already done it. The fact that Netflix hasn’t solved this problem is a strong indicator that piling more proprietary complexity on top of all the vendor complexity you inherit from using a given service, well that’s a really hard problem in and of itself.


True multi-cloud redundancy is hard to test - because it’s everything from DNS on up and it’s hard to ask AWS to go offline so you can verify Azure picks up the slack.


I deeply concur with this statement. I think folks here are conflating a one off test versus keeping your redundancy up to date as apps evolve.


Sure you can. Firewall AWS off from whatever machine does the health checks in the redundancy implementation.


What happens when your health check system fails?


It's true, but you can do load balancing at the DNS level.


And you will get 1/N of requests timing or erroring out, and in the meanwhile paying 2x or 3x the costs. So, it might be worth in some cases but you need to evaluate it very, very well.


Or rent bare metal servers like old times and be responsible for your own s*t


Still plenty of networking issues that can knock you down hard.


... and be responsible for your own s*t

Don't miss the point of being able to do something about it instead of multi hours outage and being in the dark regarding what is going on.


> And, finally, hold cloud providers accountable. If they’re unstable and not providing service you expect, leave. We’ve got tons of great options these days, especially if you don’t care about proprietary solutions.

Easy to say, but difficult to do in practice (leaving a cloud provider)


Absolutely hard. But that doesn’t mean if you’re in a position to start a company from scratch that you can’t walk away. Or if you go to another company and are involved in their procurement of a new purchase, that you can’t sway it away from said provider.

Just because it takes years doesn’t meant it can’t happen.


> Third, you’re gonna go down when the cloud goes down. Not much use getting overly bent out of shape.

Ugh. I have a hard time with this one. Back in the day, EBS had some really awful failures and degradations. Building a greenfield stack that specifically avoided EBS and stayed up when everyone else was down during another mass EBS failure felt marvelous. It was an obvious avoidable hazard.

It doesn't mean "avoid EBS" is good advice for the decade to follow, but accepting failure fatalistically doesn't feel right either.


I hear you. I didn’t use EBS for five years after the great outage in, what was it, 2011?

At this point, it’s reliable enough that even if it were to go down, it’s more safe than not using it. I’d put EBS in the pantheon of “core” services I never mind using these days.


Yup, 2011. That's the one. One of those US presidential campaigns stayed up throughout because of EBS-phobia.

Geez. We have decades-old cloud war stories now? I suddenly feel really old.


> Safety of your customer’s data is your ethical responsibility. Protect it, back it up, keep it as generally available as is reasonable.

> Third, you’re gonna go down when the cloud goes down. Not much use getting overly bent out of shape.

“Whoops, our provider is down, sorry!” is not taking responsibility with customer data at all.


Respectfully disagree. No company in the world has 100% uptime. Whether it’s your server rack or their server rack going down means nothing to a customer.

We’re not discussing data loss in this thread specifically. This is about a couple of hours of downtime per year.


Hey Wes! I upvoted your comment before I noticed your handle. +1 insightful, as usual


Brown nose


Troll




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: