Me: Hesitation at last job moving absolutely everything (including backups) to AWS because if it goes down it's a problem I'm a firm believer in some kind of physical/easily accessible backup.
Coworkers: "You're an f'n idiot. Amazon and Facebook don't go down, you're holding us back!" <-Quite literally their words.
Me: leaves cause that treatment was the final straw
Amazon and Facebook both go down within a month of each other, and supposedly they needed backups
I'd be surprised if they needed backups for a few hours of downtime with (reportedly) complete recovery where no data was corrupted. There are industries where this would be required, and it's possible I guess, but neither of these downtime events were "data loss" events, just availability events for short-ish periods of time that wouldn't - for me - result in activating our DR plans.
I must admit that I do always try and maintain a separate data backup for true disaster recovery scenarios - but those are mainly focused around AWS locking me out of our AWS account (and hence we can't access our data or backups) or recovering from a crypto scam hack that also corrupts on-platform backups, for example.
I once had to argue that we still do need backup even though S3 has redundancy. They laughed when I mentioned a possible lock-up from AWS (even due to a mistake or whatever). I asked what if we delete data from app by mistake? They told me we need to be careful not to do that. I guess I am getting more and more tired of arrogant 25 years old programmers with 1-2 years in industry and no experience.
One thing you should absolutely not count on, but might be a course of actions for large clients, is to contact support and ask them to restore accidentally / maliciously deleted files.
I would never use this as part of the backup and restore plan; but I was lucky when a bunch of customer files were deleted due to a bug in a release. Something like 100k files were deleted from Google Storage without us having backup. In a panic we contact GCP. We were able to provide a list of all the file names from our logs. In the end, all but 6 files were recovered.
I think it took around 2-3 days to get all the files restored, which was still a big headache and impactful to people.
This is not a reliable mechanism btw. There will be times when they won't be able to restore the data for you. Their product has options to avoid this situation like object versioning.
S3 and (others) have version history that can be enabled.
If you have to take care of availablity and redundancy and delete protection and backups then why pay the premium S3 is charging ?
Either you don't trust the cloud and you can run NAS or equivalent (with s3 APIs easily today) much cheaper or trust them to keep your data safe and available.
No point in investing in S3 and then doing it again yourself.
> No point in investing in S3 and then doing it again yourself.
I mean that's just obviously wrong, though.
There is a point.
> Either you don't trust the cloud and you can run NAS or equivalent (with s3 APIs easily today) much cheaper or trust them to keep your data safe and available.
What if you trust the cloud 90%, and you trust yourself 90%, and you think it's likely that the failure cases between the two are likely to be independent? Then it seems like the smart decision would be to do both.
Your position is basically arguing that redundant systems are never necessary, because "either you trust A or you trust B, why do both?" If it's absolutely critical that you don't suffer a particular failure, then having redundant systems is very wise.
My point is if your redundancy is better than AWS then why pay for them ? If it not they why invest in your own?.
You can argue that you protect against different threats than AWS does . So far I have not seen a meaningful argument of threats a on Prem protects differently than the cloud that you need both.
Say for example your solution is to put all your data backups on the moon then it makes sense to do both, AWS does not protect against threat to planet wide issues.
However if you are both protecting against exact same risks having just provider redundancy only protects against events like AWS goes down for days /months or goes bankrupt.
All business decisions have some risk , provider redundancy does not seem a risk to mitigate for the cost it would mean for most businesses I have seen.
Even Amazon.com or Google apps host on their own cloud and not use multi cloud after all, their regular businesses are much bigger than their cloud biz , they would still risk those to stick to their cloud/services only.
> My point is if your redundancy is better than AWS then why pay for them ? If it not they why invest in your own?
This is a really confusing question. Redundancy requires more than 1 option. It's not about it being better than AWS, it's that in order to have it you need something besides just AWS. AWS may provide redundant drives, but they don't provide a redundant AWS. AWS can protect against many things, but it cannot protect against AWS being unavailable.
> Even Amazon.com or Google apps host on their own cloud and not use multi cloud after all, their regular businesses are much bigger than their cloud biz
This is probably true with Google, but AWS contributes > 50% of Amazon's operating income. [1]
Interesting, no wonder AWS head became Amazon CEO.
Their retail/e-commerce side is less profitable than AWS but the absolute revenue is still massive and the risk of losing that a chunk of that revenue(and income) due to tech issues is still enormous risk for Amazon .
You and AWS are using similar chips similar hard disks even with similar failure rates.
If you both use same hardware from say batch both can defects and fail at similar times.or you use the same file systems, that say corrupts both your backups.
90% is not a magic number , you need to know AWS supply chains and practices thoroughly and keep yours different enough not to have same risks as AWS does for your system to have independent probability of failures.
True. One would want to continually decorrelate services or model the dependencies. Redundancy will help even with some dependency, but you raise an important point.
But you still have some risks here, yes, with a super low probability, but a company-killing impact.
In some industries - banking, finance, anything regulated, or really (I'd argue) anywhere where losing all of your data is company killing - you will need a disaster recovery strategy in place.
The risks requiring non-AWS backups are things like:
- A failed payment goes unnoticed and AWS locks us out of your AWS account, which also goes unnoticed and the account and data are deleted
- A bad actor gains access to the root account through faxing Amazon a fake notarized letter, finding a leaked AWS key, social engineering one of your DevOps team, and encrypts all of your data while removing your AWS-based backups
- An internal bad actor deletes all of your AWS data because they know they're about to be fired
...and so on.
There's so many scenarios that aren't technical which can result in a single vendor dependency for your entire business being unwise.
A storage array in a separate DC somewhere where your platform can send (and only send! not access or modify) backups of your business critical data ticks off those super low probability but company-killing impact risks.
This is why risk matrices have separate probability and impact sections. Miniscule probability but "the company directors go to jail" impact? Better believe I'm spending some time on that.
Just to add that S3 supports a compliance object lock that can't even be overridden by the root user. Also AWS doesn't delete your account or data until 90 days after the account is closed.
Between these two protections, it's pretty hard to lose data from S3 if you really want to keep it. I would guess they are better protections than you could achieve in your own self managed DC.
I'm guessing AWS has some clause in their contract that means they can refuse to deal with you or even return any of your data if they feel like it. Not sure if that's ever happened, but still worth considering it.
Yes threat models is obvious qualifier, if you have a business that requires backup on the moon if there asteroid collision then by all means got for it.[1]
For most companies what AWS.or Azure offers is more than adequate.
An internal bad actor with that level of privileged access can delete your local backups or external one can all things you he can do to AWS he can likely do easier to your company storage DC too.
Bottom-line it doesn't matter if customers can pay for all this low probability stuff that can only happen on the cloud and not on Prem sure go ahead. Half the things customers pay for they don't need or use anyway.
[1] assuming your business model allows you to spend the expense outlay you need for the threat model
Nope. 3-2-1 strategy. 3 Backups, 2 Medias, 1 Offsite. Now try to delete files from the media in my safe. Only I have a key.
Sure, your threat model may vary. But relying on cloud only for your backup is simply not enough. If you split access for your AWS backup and your DC backup to two different people, you mitigated your thread model. If you only have 1 backup location, that's going to be very hard.
All of these are questions asked and solved 10 years ago by bean counters who only job is risk mitigation.
Every cloud provider has compliance locks which even root user cannot disable, version history and you can setup your own copy workflow storage container to second container without delete/update access to second one to two different people or whatever.
Not sure I agree about the usefulness of different media.
Having had to restore databases from tapes and removable drives for a compliance/legal incident, we had a failure rate of >50% on the tapes and about 33% for the removable drives.
I came away not trusting any backup that wasn’t on line.
At $50/month scale a lot of things are possible. Most companies cannot store their data in a hard disk in a safe. If you can, then cloud is a convenience not a necessity for you. I.e. you are perfectly fine running your storage stack for the most part.
My company is not very big(100ish employees) and we pay $200k+ for AWS in just storage and AWS is not even out primary cloud. If we have to do what you have, it is probably in bandwidth costs alone another $500k. Add running costs in another cloud and recurring bandwidth for transfers , retrieval from Glacier for older data on top of that.[1]
Over 3 years that would be easily $1-$1.5 million in net new expenses for us scale.
No sane business is going to sign off on +3x storage costs on a risk that cannot be easily modeled[2] and costs that cannot be priced into the product, just so one sysadmin can sleep better at night.
[1]your hard disk in a safe third component is not sensible discussion point at reasonable scale.
[2] this would be probability of data loss with AWS * business cost of losing that data > cost of secondary system.
Or probability of data availablity event(like now) * business cost of that > cost of an active secondary system .
For almost no business in the world the either equation would be valid.
For example even the cost is 100B dollars in revenue with 6 nines of durability the expected loss would be only $10,000 (100B * 0.000001) a secondary system is much costlier than that.
There are completely independent risks that you are dealing with here. If you are a small company there is a non-insignificant risk that your cloud account will be closed and it will be impossible to find out why or to fix it in a timely matter. There have been several that were only fixed after being escalated to the front page of Hacker News, and we haven't heard about the ones that didn't get enough upvotes to get our attention and were never fixed.
Also, what we saw on Dec 7th was that the complexity of Amazon's infrastructure introduces risks of downtime that simply cannot be fully mitigated by Amazon, or by any other single provider. More redundancy introduces more complexity at both the micro level and macro level.
It doesn't really cost that much to at least store replicated data in an independent cloud, particularly a low-cost one like Digital Ocean.
Backup on site and store tertiary copies in a cloud. Storing all backups in AWS wouldn't meet a lot of compliance requirements. Even multiple AZs in AWS would not pass muster as there are single points of failure (API, auth, etc).
Whether you realize it or not, you believe in the Scapegoat Effect, and it's going to get you into a shitload of trouble some day.
Customers don't care if it's you're fault or not, they only care that your stuff is broken. That safety blanket of having a vendor to blame for the problem might feel like it'll protect your job but the fact is that there are many points in your career where there is one customer we can't afford to lose for financial or political reasons, and if your lack of pessimistic thinking loses us that customer, then you're boned. You might not be fired, but you'll be at the top of the list for a layoff round (and if the loss was financial, that'll happen).
In IT, we pay someone else to clean our offices and restock supplies because it's not part of our core business. It's fine to let that go. If I work at a hotel or a restaurant, though, 'we' have our own people that clean the buildings and equipment. Because a hotel is a clean, dry building that people rent in increments of 24 hours. Similarly, a restaurant has to build up a core competency in cleanliness or the health department will shut them down. If we violate that social contract, we take it in the teeth, and then people legislate away our opportunities to cut those corners.
For the life of me I can't figure out why IT companies are running to AWS. This is the exact same sort of facilities management problem that physical businesses deal with internally.
I have saved myself and my teams from a few architectural blunders by asking the head of IT or Operations what they think of my solution. Sometimes the answer starts with, "nobody would ever deploy a solution that looked like that". Better to get that feedback in private rather than in a post-mortem or via veto in a launch meeting. But I have had less and less access to that sort of domain knowledge over the last decade, between Cloud Services and centralized, faceless IT at some bigger companies. It's a huge loss of wisdom, and I don't know that the consequences are entirely outweighed by the advantages.
Having an additional AWS account which some S3 backs up to, with write only permissions (no delete) and in an account that is not used by anyone, seems like a good idea for this type of situation/concern.
I had this experience when I asked about s3 backup also (after a junior programmer deleted a directory in our s3 bucket...). The response from r/aws was "just don't let that happen" or ("use IAM roles")
Maybe they are getting tired of arrogant older programmers assuming they cannot possibly be wrong. God forbid a 25 year old might actually have a good idea (and I am far removed from my 20s).
Maybe having S3 redundancy wasn't the most important thing to be tackled? Does your company really need that complexity? Are you so big and such an important service that you cannot possibly risk going down or losing data?
I'm not sure how you got "backups are for old people" from my post. My point is that there are two sides to this. Perhaps the data being stored on S3 data _was_ backup data and this engineer was proposing replicating the backup data to GCP. That's probably not the highest priority for most companies. Maybe the OP was right and the other engineers were wrong. Who knows.
In my experience, the kind of person that argues about "arrogant 25 year olds that know everything" is the kind of person that only sees their side of a discussion and refuses to understand the whole context. Maybe OP was in the right, maybe they weren't. But the fact that they are focusing on age and making ad hominem attacks is a red flag in my book.
I’ve most definitely been in numerous places where arrogant 25 year olds with CS degrees but not smart enough to make it to FAAnG think they know what they are talking about when they don’t. Not every 25yo is an idiot, but many especially in tech think they are smarter than they are because they’re paid these obscene amounts of money.
But that's just it; you can't even have that discussion if the response to "hey, should we be backing up beyond S3 redundancy?" is "No. Why would we? S3 is infallible"
Sure you can. As the experienced engineer in that setting it is a great opportunity to teach the less experienced engineers. For example, "I have seen data loss on S3 at my last job. If X, Y, or Z happen then we will lose data. Is this data we can lose? And actually, it is pretty easy to replicate - I think we could get this done in a day or two."
It's also possible the response was "That's an excellent point! I think we should put that on the backlog. Since this data is already a backup of our DB data, I think we should focus on getting the feature out rather than replicating to GCP."
Those are two plausible conversations. Instead, what we have is "these arrogant 25 year olds that have 1-2 years of experience and know it all." That's a red flag to me.
>"Maybe they are getting tired of arrogant older programmers..."
And this is of course valid reason to ignore basic data preservation approaches.
Myself I am an old fart and I realize that I am too independent / cautious. But I see way too many young programmers who just read sales pitch and honestly believe that once data is on Amazon/Azure/Google it is automatically safe, their apps are automatically scalable, etc. etc.
Yes - the point of that line was to be ridiculous. Age has nothing to do with it. Anyone at any age can have good ideas and bad ideas. There are some really incredibly _older_ and highly experienced engineers out there. But there are others that think that experience means they are never wrong. Age has nothing to do with this - what is important is your past experience, your understanding of the problem and then context of the problem, and how you work with your team.
And again, my point isn't that you never need backups. My point is that it is entirely plausible that at that point in time backups from S3 weren't a priority.
Would you put the one and only copy of your family photo album up on AWS, where AWS going down would mean losing it? Because your customers' data is more important than that
AWS going down means I've lost it or temporarily lost access to it? Those are two very different scenarios. Of course S3 could lose data - a quick Google search shows it has happened to at least one account. My guess is it is rare enough that it seems like a reasonable decision to not prioritize backing up your S3 data. I'm not syaing "never ever backup S3 data" only that it seems reasonable to argue it's not the most important thing our team should be working on at this moment.
I have my family photos on a RAIDed NAS. It took me years to get that setup simply because there were higher priority things in my life. I never once thought "ahh I don't need backups of our data" I just had more important things to do.
AWS has had at least one documented incident where a region had an S3 failure that was not recoverable. They lost about 2% of all data. That might not sound like much but if you have a lot of data, partial restoration of that data doesn't necessarily leave your system in a functional state. If it loses my compiled CSS files I might be able to redeploy my app to fix it. Then again if I'm a SaaS company and that file was generated in part from user input, it might be more difficult to reconstruct that data.
Which incident is this? I can’t find it online. The closest I can recall is when they lost some number of EBS volumes. We were affected by that, but ran snapshots (to s3) to recover the affected servers.
Today's gentle reminder that there are things other than network or service outages that can and do occur that might necessitate an outside backup.
What happens if AWS or [insert other megacloud] decides your account needs to be nuked from orbit due to a hack or some other confusion? We almost had this happen over the summer because of a problem with our bank's ability to process ACH payments. Very frustrating experience. Still isn't fully resolved.
What happens if an admin account is taken over and your account gets screwed up?
What happens if an admin loses his shit and blows up your account?
What happens if your software has a bug that destroys a bunch of your data or fubars your account?
There's a ton of cases where having at least a simple replica of your S3 buckets into a third-party cloud could prove highly valuable.
Would you be able to expand at all about the ACH/AWS connection, obviously without identifying details?
Was it just a miscommunication around AWS billing and them thinking you weren't paying? Or did AWS somehow put itself in the middle of, or react to, your use of ACH payment processing for *non-AWS* receivables or payables?
If the latter, that's a business risk I'd never even thought about. I'm not even sure how they'd know. But I'm thoughtful that things like the MATCH list [0] exist, and how easily a merchant can accidentally wind up on these lists from either human error or a small amount of high-value chargebacks. If cloud providers are somehow paying attention to merchant services reputation, that would be very scary for many businesses!
Like most of these things, it was a series of unfortunate events.
In our case {LargeCloud} acquired {SaaSVendor}. We were already using {LargeCloud}, with an existing billing arrangement. When {LargeCloud} got around to integrating the {SaaSVendor} into their billing system, it exposed multiple bugs in {LargeCloud}'s billing system, and ultimately limitations in our bank's internal systems--a well known establishment and it would blow your mind to learn how much manual crap they do.
Traditionally, we received favor from {SaaSVendor} through Invoices. But when {SaasVendor} was subsumed by {LargeCloud}, we stopped receiving invoices. Our internal ops reached out to {LargeCloud} about this two days before we got our first "You will experience Dire Consequences" email from {LargeCloud}'s Robot Overlords. Our attempts to contact {LargeCloud} regarding this concerning message was always routed to a Robot Overlord who only spoke in tongues and could not solve our problems. Eventually, were able to get the Robot Overload to escalate us to a Robot Superlord that would only tell us to "follow the instructions in this handy dandy web page thing", except following the instructions always summoned a "Server 500" Demon, which {LargeVendor} claimed was impossible because their Robots are Divine and Holy.
Finally circling back through random Human Actors we were able to avert the countdown to destruction. Some Robot Necromancer was able to resurrect our billing account from the "Server 500" Demon, but we would now need to setup automatic ACH payments, as whatever fix was implemented could only persist with regular monthly succor upon the alters of the Federal Reserve Automated Clearing WaffleHouse. Invoices, payments arranged through Our Lady of Visa and The Master Card would no longer suffice.
We believed we had made the appropriate incantations before FratBoy 3000 at our local branch of the Federal Reserve Chapel. However, we eventually received another threat of Dire Consequences from {LargeCloud}, indicating that our prayers were not received. It took significant supplication in order to get FratBoy 3000 to confirm that our Federal Reserve Chapel had misrouted our prayers, deducting them from our account, but sending them to the wrong Demon, through no fault of our own.
The whole time this was going on, we kept getting threats of Dire Consequences. We were told by Human Actors to have great faith, that the {LargeVendor} Robot Overlords had been placated through their secret prostrations. FratBoy 3000 was replaced by our Federal Reserve Chaplain, who informed us that they had no robots, this was all the result of Human Actor failures, but that, forthwith, all of our prayers could be answered if we moved all of our faith into a New Account which itself required additional monthly supplication, but would ensure divine routing of our prayers would always be successful.
To this day, we continue to make our monthly pilgrimage to our local Federal Reserve Chapel, supplicating upon all necessary altars. The threats of Dire Consequences from {LargeCloud} have subsided. But we have cast ourselves out onto the trail, seeking refuge from a more receptive and responsive Federal Reserve Chapel.
Everybody focuses on "what if us-east-X goes down", but, literally, sometimes it's a combination of billing and payment issues that can keep you up at night.
I would make a friendly wager that AWS user IDs don't contain check digits, let alone bullet proof ones (simple check digits don't guard against transposition errors). And that somewhere, someone can manually enter an account to delete, and that one of us will eventually have an account numbered XXX1234 and some idiot with account XXX1243 will legitimately earn an account deletion, but we'll be the ones who wake up to bad news.
1) Can you make your on prem infrastructure go down less than Amazon's?
2) Is it worth it?
In my experience most people grossly underestimate how expensive it is to create reliable infrastructure and at the same time overestimate how important it is for their services to run uninterrupted.
--
EDIT: I am not arguing you shouldn't build your more reliable infrastructure. AWS is just a point on a spectrum of possible compromises between cost and reliability. It might not be right for you. If it is too expensive -- go for cheaper options with less reliability.
If it is too unreliable -- go build your own yourself, but make sure you are not making huge mistake because you may not understand what it actually costs to build to AWSs level.
For example, personally, not having to focus on infra reliability makes it possible for me to focus on other things that are more important to my company. Do I care about outages? Of course I do, but I understand doing this better than AWS has would cost me huge amount of focus on something that is not core goal of what we are doing. I would rather spend that time thinking how to hire/retain better people and how to make my product better.
And adding all that complexity of running this infra to my company would cause entire organisation be less flexible, which is also a cost.
So you can't look at cost of running the infra like a bill of materials for parts and services.
And if there is an outage it is good to know there is huge organisation there trying to fix it while my small organisation can focus preparing for what to do when it comes back up.
On the other hand, perhaps the large cloud providers bring a level of complexity that outweighs their skills at keeping everything up. What I mean is, a basic redundancy and failover setup with two data centers is kind of straightforward. Sure you need a person on call 24/7 to oversee it, but it's conceptually not that complicated. And if you're running bare metal, you get a surprising level of performance per dollar and rack unit. On the other hand, the big clouds are immensely complex with multiple layers of software defined networking, millions of tenants, thousands of employees, acres of floor space, org charts, etc. If you're running your own infra as one competent sysadmin, you know nobody else in another department will push a networking code change that will break your shit in the middle of the night. Maybe it's not right for everyone, but it's not unreasonable to go on prem in 2021 despite the popular opinions otherwise. Source: my company runs on prem and routinely has 100% uptime years. Most unplanned downtime occurs early on a Sunday morning following a planned action during a maintenance window.
I was and continue to be surprised how reliable even old servers are. I run a small homelab (Debian VMs on Proxmox; a Docker host, a jumpbox, a NAS running ZFS, etc.) on seven year old hardware, and all of my problems are self-imposed. If I leave everything alone, it just works.
As a counterpoint, though, my last place had a large Java app, split between colo'd metal and AWS. Seemed like the colo'd stuff failed more (bad RAM mostly, a few CPUs, and an occasional PSU). Entirely anecdotal.
> Can you make your on prem infrastructure go down less than Amazon's?
Obviously depends on what you need, but for a small to medium web app that needs a load-balancer, a few app servers, a database and a cache, yes absolutely - all of these have been solved problems for over a decade and aren't rocket science to install & maintain.
> Is it worth it?
I'd argue that the "worth" would be less about immunity to occasional outages but the continuous savings when it comes to price per performance & not having to pay for bandwidth.
> overestimate how important it is for their services to run uninterrupted.
Agreed. However when running on-prem, should your service go down and you need it back up, you can do something about it. With the cloud, you have no choice but to wait.
I have run high availability (HA) systems in prem and your statement vastly understates the difficulty and expense.
You need multiple physical links in running to different ISPs because builders working on properties further down the street could accidentally cut through your fibre. Or the ISP themselves could suffer an outage.
You need a back up generator and to be a short distance away from a petrol station so you can refuel quickly and regularly when suffering from longer durations of power outages. You absolutely do not want to run out of diesel!
You need redundancy of every piece of hardware AND you need to test that failover works as expected because the last thing you need is a core switch to fail and traffic not to route over secondary core switch like expected.
You need your multiple air con units and them to be powered off different mains inputs so if the electrics fail on one unit it doesn’t take out the others. I guarantee you that if the air cons will fail, it will be on the hottest day of the year a month amount of portable units will stop your servers from overheating.
You need beefy UPS with multiple batteries. Ideally multiple UPSs with each UPS powering a different rail on your racks so that if one UPS fails your hardware is still powered from the other rail. And you need to regularly check the battery status and loads on the UPS. Remember that the back up generator takes a second or two to kick in so you need something to keep the power to the servers and networking hardware to be uninterrupted. And since all your hardware is powered via the UPS, if that dies you still lose power even if the building is powered.
And you then need to duplicate all of the above in second location just in case the first location still goes down.
By the way, all of the possible failure points I’ve raised above HAVE failed on me when managing HA on prem.
The reason people move to the cloud for HA is because rolling your own is like rolling your own encryption: it’s hard, error prone, expensive, and even when you have the right people on the team there’s still a good chance you’ll fuck it up. AWS, for all its faults, does make this side of the job easier.
That's true for on-prem infrastructure, but is all already handled for you if you rent servers from hosting providers such as OVH/Hetzner or even rent colocation space in an existing DC, and is still cheaper than the cloud equivalent (and as we saw recently, actually more reliable as well).
But then you’re still reliant on those hosting providers not fscking up; just like with cloud providers. Literally the same complaint the GP was making about AWS applies for OVH et al too.
In fact I used to run some hobby projects in OVH (as an aside, I really liked their services) so I’m aware that they have their own failures too.
Are they though? Let’s look at what the recent AWS outages have been: a single region (but AWS makes multi-region easy). The biggest impact to most people is the AWS console, something that one seldom actually needs given AWS is API driven. If the same type of outage happened on OVH then you’d lose KVM to your physical servers. But you seldom need those either.
The Azure outage was just AD service but you can roll your own there if you wanted.
Plus if you want to talk about SaaS then OVH et al have their own SaaS too. In fact the difference between OVH and AWS is more about scale than it is about reliability (with AWS you can buy hardware and rack it in AWS just like with OVH too).
Or maybe by “old skool” you mean the few independent hosts that don’t offer SaaS. However they’re usually pretty small fry and this outages are less likely to be reported. Whereas any AWS service going down is massive news.
I’m not a cloud-fanboy by any means (I actually find AWS the least enjoyable to manage from a purely superficial perspective) but I’ve worked across a number of different hosting providers as well as building out HA systems on prem and the anti-cloud sentiment here really misses the pragmatic reality of things.
Most people using AWS aren't using multi-region, as evidenced by the wide array of problems on the internet when a region goes down.
I would also argue many aren't even using multiple availability zones, as evidenced by the wide array of problems on the internet when a single AZ goes down.
I think you're vastly over-estimating how most companies are using AWS, and are substituting your own requirements for theirs.
Which is very common in tech. It's part of why people shit on cloud, microservices, and other techniques large mega-corps use on HN. People write posts with lots of assumptions and few details, then people that don't know any better just carbon copy it because hey its what Google does. Meanwhile their lambda microservice system serving a blazing 60 requests per minute has more downtime than if I just hosted it on my laptop with my dialup internet connection.
I fully believe some people are doing AWS wrong. But you cannot compare the worst offenders in AWS against the best defenders of deploying on prem - it’s just not a fair like for like comparison comparing the worst against the best.
Hence why I compare doing HA in AWS correctly vs doing HA on prem correctly.
I've had way more networking and availability failures from Hetzner this year alone then I've ever seen from AWS. They regularly replace their networking switches without any redundancy, leaving entire DCs offline for hours. They're okay for hobby projects, but I would never host a business-critical site with them
Yes, Hetzner upgrades DCs (datacenter buildings), but they are the equivalent to AWS AZs (Availability Zones). When they upgrade a DC, they notify way in advance, and if you set up your services to span multiple DCs as is recommended, it does not affect you.
We run high-availability Ceph, Postgres, and Consul, across 3 Hetzner DCs, and have not had a Hetzner-induced service downtime in the 5 years that we do so.
That's fair enough, I was comparing single-AZ AWS outages to single-DC Hetzner outages, since that seems to be what people are focusing on. For multi-DC deployments, I think laumars' sibling response to mine makes a much better argument—ultimately, you're still choosing who to pay and who to trust, and if things go down, there's nothing you can do to fix it. "Low-tech" cloud providers like Hetzner, Colo providers, amazon, PaaS—in a physical downtime event like this one, they're all the same.
@nostrebored Well, Hetzner never went down for the 7 years I managed a HA setup spanning three of their data centers. One of the DCs was unavailable for a few hours during a planned moving op, but we had no outages. None.
That’s not what you’ve seen recently. When Hetzner goes down nobody cares, because nobody with important workloads and brain cells is running then there.
Colo space assumes that the colo is operating more efficiently than AWS/Azure/GCP when in reality you’re comparing apples and oranges.
It is much easier than you think. There are well-defined standards and trained trades people and whole host of companies who make great products and provide after-sales services to do it. Every major financial services, telecom and high-precision manufacturing companies run their infra this way. It is definitely less niche than rolling your own encryption.
financial services, telecom and high-precision manufacturing companies
One of these things is not like the other, one of these things is not the same...
What use does a CNC shop have for an extensive on-prem multi-DC with failover and high availability? It'd be like buying your own snowplows to make sure that the road is clear so your employees can get to work. Maybe necessary if you live in a place with very bad snowplows and no existing infrastructure, but in most places, just a waste of money.
My analogy wasn’t saying it’s niche. It was comparing the difficulty. And yes, there are trained people (I’m one of them ;) ) but that doesn’t make it easy, cheap, nor even less error prone than using cloud services. Which was my point.
Also the reasons those companies usually run their own infra is historically down to legislation more than preference. At least that’s been the case with almost all of the companies I’ve built on prem HA systems for.
> You need multiple physical links in running to different ISPs because builders working on properties further down the street could accidentally cut through your fibre.
At my last job we provided redundant paths (including entry to your building) as an add-on service. So you might not need two ISPs if you're only worried about fiber cuts. You could still be worried about things like "we think all Juniper routers in the world will die at the exact same instant", in which case you need to make sure you pick an ISP that uses Cisco equipment. And of course, it's possible that your ISP pushes a bad route and breaks the entirety of their link to the rest of the Internet.
> You need a back up generator and to be a short distance away from a petrol station so you can refuel quickly and regularly when suffering from longer durations of power outages.
I don't see why the petrol station needs to be a short distance away. Unless the plan is to walk to the petrol station and back (which should not be the plan[1]), anyplace within reasonable driving distance should do.
[1] long duration electrical outages will often take out everything a short distance away, and the petrol stations usually have electric pumps.
Because there are laws on what you can and cannot fill with fuel. So you may find you have to make smaller but more frequent visits.
Also buying fuel for a petrol station is going to be more expensive than having a commercial tanker refill it. So ideally you wouldn’t be making large top ups from the local petrol station except under exceptional outages.
As for wider power outages affecting the fuel pumps, I suspect they might have their own generators too. But even if they don’t, outages can still be localised (eg road works accidentally cutting through the mains for that street - I’ve had that happen before too). So there’s still a benefit in having a petrol station near by.
To be clear, I’m not suggesting those petrol stations should be 5 minutes walking distance. Just close enough to drive there and back in under half an hour.
A typical multi-MW power-hungry high-tech facility (datacenters, manufacturing, hospitals etc) will have large underground fuel storage tanks big enough to run the full load on generators for couple of days and they are continuously kept refilled via fuel tanker trucks through contracts with bulk fuel distributors. They usually have an SLA of a 40KL tanker in 4 hour notice. In case of advance warning of heavy-rains/floods or other natural disasters that can disrupt road networks, they can have more fuel trucks situated close-by as stand-by. Depending on your contract, you may have priority over other customers in the area. These are fairly standard practices.
indeed but that wasn’t the type of facilities that the GP was talking about when they said running web services were a solved problem.
If you do move to an established data centre then you’re back to my earlier point that you’re still then dependant on their services instead of having ownership to fix all the problems yourself (which was the original argument the GP made in favour of switching away from the cloud).
> I don't see why the petrol station needs to be a short distance away
some natural disasters can render driving trickier than walking. extremely large snow storms, for instance. you can still walk a block, but you might be hard pressed to drive 5 miles.
(i don't have a bone in this particular cautiousness-fight; personally i'd just suggest folks producing DR plans cover the relevant natural disasters for the area they live in, while balancing management desires, and a realistic assessment of their own willingness to come to work to execute a DR plan during a natural disaster.)
You really don't need almost any of this stuff. If you have small on-prem needs just grab a couple fiber links, try for diversity on paths for them (good luck), add some power backup if it fits your needs, and be done.
If you are going to the level of the above, you go with co-location in purpose built centers at a wholesale level. The "layer1" is all done to the specs you state and you don't have to worry about it.
On-prem rarely actually means physically on-prem at any scale beyond a small IT office room. It means co-locating in purpose built datacenters.
I'm sure examples exist, but the days of large corporate datacenters are pretty much long over - just inertia keeping the old ones going before they move to somewhere like Equinix or DRT. With the wholesalers you can basically design things to spec, and they build out 10ksqft 2MW critical load room for you a few months later.
A few organizations will find it worthwhile to continue to build at this scale (e.g. Visa, the government) but it's exceptionally small.
> You really don't need almost any of this stuff. If you have small on-prem needs just grab a couple fiber links, try for diversity on paths for them (good luck), add some power backup if it fits your needs, and be done.
Then you’re not running HA and thus the argument about cloud downtime being “worse” than on prem is moot.
Obviously if your SLA is basically “we will do our best” then there are all sorts of short cuts one can take. ;)
My (late) point is that when people speak about "on-prem" these days, they are not talking about building corporate datacenters on campus.
30 years ago when you talked on-prem that's what this meant. It's now shifted to on-prem meaning your own hardware in massive shared facilities that handle all that "hard stuff" like redundant power and cooling for you.
Bespoke datacenter builds for true-on prem certainly exist, but it's not what that term typically means any longer - at least in my line of business. When I'm selling racks of colo now, my customers are calling that their on-prem facilities.
In fact a large part of my previous business was dismantling true "on-prem" facilities to move to such large shared wholesalers.
Does it have its own gas well? My sister has a home backup generator, they lost power during some cold snap and some pumping component failed and her neighborhood lost gas too. The only house in the neighborhood that had heat/power had a big propane tank because it was built before the neighborhood got gas.
I’ve never seen a data center with natural gas backup power. But I don't know if that's because of reliability or if it's too expensive for a big natural gas hookup that's used rarely. Though I have heard of the opposite -- using natural gas turbines as primary power and utility power as backup.
Human capital side would disagree with that I think. You're assuming the organization which owns this small/medium web app has the personnel already on staff to handle such a thing.
If you're outsourcing that, you'd likely have to pay a boatload just for someone to be available for help, let alone the actual tasks themselves. Like you said, if you're on-prem and something goes down, you can do something. But you've gotta have the personnel to actually do something.
That said, I think you're spot-on as long as you have the skillset already.
> Human capital side would disagree with that I think
I hear this argument a lot, but every startup I've been involved with had a full-time DevOps engineer wrangling Terraform & YAML files - that same engineer can be assigned to manage the bare-metal infrastructure.
> I hear this argument a lot, but every startup I've been involved with had a full-time DevOps engineer wrangling Terraform & YAML files - that same engineer can be assigned to manage the bare-metal infrastructure.
Bare metal infrastructure requires a lot more management at any given scale. I mean, you can run stuff that lets you do part of the management the same as cloud resources, but you also have to then manage that software and manage the hardware.
I think if you put a bit of effort into classifying importance, you can likely justify backing up certain critical systems in more than one way. Let "the cloud" handle everyone's desktop backups and all the ancillary systems you don't really need immediately to do business, but certain important systems should perhaps be backed up both to the cloud and locally, like Windows Domain Controllers and other things you can't do anything without.
Backup is cheap when you're focused about what you're backing up.
In this case, the game isn't "going down less than Amazon", it's about going down uncorrelated to Amazon. Though that's getting harder!
"In more than one way" doesn't have to be local, but it may be across multiple cloud services. Still, "local" is nice in that it doesn't require the Internet. ("The Internet" doesn't tend to go down, but the portion you are on certainly can.) Of course, as workers disperse, "local" means less and less nowadays.
> In this case, the game isn't "going down less than Amazon", it's about going down uncorrelated to Amazon.
It's possible to go down in a mostly uncorrelated way to Amazon by just being down all the time.
Obviously this is implicit in your comment, but I'll say it anyway: your backups need to actually work when you need them. You need to test them (really test them) to make sure they're not secretly non-functional in some subtle way when Amazon is really down.
It really spends on how reliable you need to be. Don’t forget you get downtime from both AWS and your own issues so even 4 9’s is off the table with pure AWS. If you need to be more reliable than AWS you need to run a hybrid inside and outside of AWS which means most of the advantages of running on AWS goes away.
Many business claim 4 9 SLAs on AWS, but that doesn’t mean they actually provide it. It’s simply a question of what the penalties of failing to reach their SLA is.
> Can you make your on prem infrastructure go down less than Amazon's?
Over the last two years, my track record has destroyed AWS. I've got a single Mac Mini with two VMs on it, plugged in to a UPS with enough power to keep it running for about three hours. It's never had a second of unplanned downtime.
About 15 years ago I got sick of maintaining my own stuff. I stopped building Linux desktops and bought an Apple laptop. I moved my email, calendars, contacts, chat, photos, etc, to Google. But lately I've swung 180 degrees and have been undoing all those decisions. It's not as much of a PITA as I remember. Maybe I'm better at it now? Or maybe it will become a PITA and I'll swing right back.
EDIT: I realize you're talking in a commercial sense and I'm talking about a homelab sense. Still, take my anecdote for what it's worth. :D
Not my company, but I work with another company that does (nearly?) all of their infrastructure on premise. They have pretty great uptime, in a large part because they're not dependent on the 3-4 global state mechanisms that consistently cause outages with cloud providers (DNS, BGP, AWS's role management/control plane, &c.).
I think you're right about what we over- & under-estimate, but that we also under-estimate the inflection point for when it makes sense to begin relying on major cloud services. Put another way: we over-estimate our requirements, causing us to pessimistically reach for services that have problems that we'd otherwise never have.
forgive me repeating myself: AWS Zones are not truly independent of each other.
Global services such as route53, Cognito, the default cloud console and Cloudfront are managed out of US-East-1.
If us-east-1 is unavailable, as is commonly the case, and you depend on those systems, you are also down.
it does not matter if you're in timbuktu-1, you are dead in the water.
it is a myth that amazon availability zones are truly independent.
please stop blaming the victim, because you can do everything right and still fail if you are not aware of this; and you are perpetuating that unawareness.
"If you're affected by us-east-1 outages then you're not hosting in other regions and you're doing it wrong".
Except: You can be affected by this outage if you did everything right. You're putting blame on people being down for not being hosted in different regions when it would not help them. You've effectively shifted blame away from Amazon and onto the person who cannot control their uptime by doing what you said.
> "If you're affected by us-east-1 outages then you're not hosting in other regions and you're doing it wrong".
You are attributing a quote to me which I never expressed, nor was that expressed elsewhere in this thread. You are even using quotation marks....
I certainly didn't mean to blame anyone. You appear to see this AWS issue as one of victims and victimizers. I was just trying to point out an agency that people may have in some situations.
My last startup migrated from Verizon Terremark after the healthcare.gov fiasco several years ago. We also suffered from that massive outage and that was the final straw in migrating to AWS.
At AWS, we built a few layers of redundant infrastructure with mulit-AZ availability within a region and then global availability across multiple regions. All this was done at roughly half the cost of the traditional hosting, even when including the additional person-hours required to maintain it on our end.
Keeping our infra simple helped that work, and it's literally been years since an outage caused by any AWS issues, even though there have been several large AWS events.
Every time one of these conversations happen I end up thinking to myself that Oxide Computing needs three more competitors and a big pile of money.
AWS maintains a fiction of turnkey infrastructure, and the reality of building your own is so starkly different that I haven't seen an IT group for some time that could successfully push back on these sorts of discussions.
Building your own datacenter is still too much like maintaining a muscle car, fiddly bits and grease under your fingernails all the time, meanwhile the world has moved on, and we now have several options in soccer mom EVs that can challenge a classic Corvette in the quarter mile, and obliterate its 0-60-0 time. There is no Hyundai for the operations people, and there should be.
I don't know the physics of shipping such a thing, but I think we really do need to be able to buy a populated and pre-wired rack and slot it into the data center. Literally slot it in. If you've ever been curious about maritime shipping, you know that they have a system for securing containers to cranes, trailers, each other, and I don't see a reason you couldn't steal that same design for mounting a server rack to the floor. Other than the pins would need to be removable (eg, a bolt that screws into a threaded hole in the floor) so you don't trip on them.
In a word, we need to make physical servers fungible. There are any number of things that we need to do to get there, but I think we can. Honestly I'm surprised we haven't heard more of this sort of talk from Dell, especially after they bought VMWare. This just seems like a huge failure of imagination. Or maybe it's simply a revolution lacking a poster child. At this rate that 'child' has already been born, and we are just waiting to see who it is.
I don't think putting the hardware into the rack is really the sticking point; what people like about the cloud is that it abstracts all kinds of details away for them and provides a cohesive system to manage it. AWS, Azure and Google are actually selling something like what you are talking about now [1], where for whatever legal/legacy/performance reason you need it on-prem but still want to pay AWS 5x the cost just to give you the same management interface, and they have some kind of pod they slap into your data-center.
What does it tell you that there is a market for this, where essentially what you are buying from them is a management and control plane, when other companies like BMC have been selling that as a standalone product for decades (and for the most part failing to live up to their customer's actual expectations)?
edit: I actually think a big pull of the cloud is also about shutting down archaic internal IT organizations that have been slowing people down so that it takes weeks and weeks to launch a simple new webservice. Better to give your programmers a cloud account and let them get shit done.
Honestly, I have an app in production that isn't completely hardened against single zone outages. There was pressure to turn off some redundancy in our caching infra, and not every backend service we call is free of tenant affinity, so we could well lose at least 1/3rd of our customers in a single AZ failure in the wrong region, or have huge latency issues for all of our tenants based on high cache miss rates.
Having written this, I'm going to ping our SME on the cache replication and remind him that since the last time he benchmarked it, we've upgraded to a newer generation of EC2 instances that has lower latency, and could he please run those numbers again.
As I discovered many years ago when our infra was only in US-EAST1, failures were also easier to explain since many, many other companies would be offline as well. It made it more of an "Internet problem" than our own company's problem. For whatever reason, customers were far more likely to accept those kinds of outages.
You could have just showed them historical data of both companies being unavailable for extended amount of time. What happened in the past few months is not new.
AWS or Google or any other reputable cloud provider are still far more better options then your local backup. Only way I see you losing your data is account getting locked.
> Offline/alt-cloud backups are probably a lot cheaper than you think, and will win you points during any audit.
With the caveat that you're going to have to implement all your access controls, monitoring and compliance mechanisms on those alternate backups. No point winning points during an audit for having backups outside AWS if you lose even more points for "backups weren't properly secured against unauthorized access".
And you're regularly restoring from those alternate backups as well to check their integrity, right?
I doubt it. The complexity of multi-cloud will also give you downtime.
Most of the folks impacted by cloud outages do not have highly available systems in place. Perhaps, for their business, the cost doesn't justify the outcome.
If you need high uptime for instances, build your system to be highly available and leverage the fault domain constructs your provider offers (placement groups, availability zones, regions, load balancing, DNS routing, autoscaling groups, service discovery, etc). For instances, double down and use spot instance and maximum lifetimes in your groups so that you're continuously validating your application can recovery from instance interruptions.
If you're heavy on applications that leverage cloud APIs, such as is often the case with labmdas, then strongly consider multi-region active/active as API outages tend to cross AZ's and impact the entire region.
To do it, first I would not use any cloud features that cannot be easily setup in another cloud. So no lambdas. Just k8s clusters, maybe DBs if they can be setup to backup between clouds. I was able to migrate from AWS k8s to DO K8S very easily.... just pointed my k8s configs to the new cluster (plus configuring the DO load balancers).
In my case, I need the dynamic DNS (havnt looked into it yet), auto-scaling is already setup with k8s, and the DB backups between DBs (next project).
You’re still in the wrong, don’t be so smug. These few downtimes are no big deal in the grand scheme of things, and your proposed solution would have been more work and headaches for little to no realizable gains, and not to mention the cybersecurity ramifications. Quite frankly, they are probably glad that you’re gone and not around to gloat about every trivial bit of downtime.
Coworkers: "You're an f'n idiot. Amazon and Facebook don't go down, you're holding us back!" <-Quite literally their words.
Me: leaves cause that treatment was the final straw
Amazon and Facebook both go down within a month of each other, and supposedly they needed backups
Them: shocked pikachu face