In the limit, there are some startups that could run production on a single Linux host - I recently helped one get off Heroku and their spend went from ~$1k/mo to ~$50/mo and it made debugging and figuring out performance issues so much easier than what they were doing previously...
I've worked at many places where everything was run off single instance (usually windows) VMs.
These systems rarely were constrained in performance by scaling issues that would have be solved by scaling horizontally, and in many cases the added latency of doing so would have caused more problems than it would have solved.
And as you say, having everything on one or two VMs is not just orders of magnitude cheaper hosting, but also comes with benefits of much more easy debugging and performance monitoring.
These weren't tiny start-ups either, these were long running services contracted out to clients including the government and other major companies.
It doesn't scale infinitely, but I'd wager that the traffic just isn't there to justify these kinds of setups in 99% of cases, and that some of the scaling needed is because all the added latency and service discovery etc is adding overhead that wouldn't be needed without it.
I've often found it odd that people who strive for YAGNI at the code level don't apply the same to the system architecture level.
I'm very much with you on this, but I do understand that it's one of those things that is just not feasible when your team has no sysadmin/devops experience.
You were able to do it, but what happens to them when you're not around? Does their team have the required experience to handle it? That's the difference in cost. It's like DYI - yes, if I have all the skills and experience I can do everything myself incredibly cheaply, but... I don't. So I gotta pay.
But if your team does have the skills and experience, it's definitely worth looking into.
I do think people deeply underestimate what can be achieved with a single (or two, if you need a standby) dedicated Linux server these days. A single server can easily go up to 2TB of RAM and 128+ cores. Long before you ever get to a scale where that's a limitation, you'll have more than enough resources to figure things out.
Funnily enough, small amount of servers that you want to utilize as much as possible is pretty much one of the original use cases for kubernetes.
My first deployment involved virtual machines, but we were essentially packing as much as possible into lowest amount of VMs, then doubled it up so that we had failover. This way we had clear visibility of how many resources we were using and could allocate them according to how much we had.
You were able to do it, but what happens to them when you're not around?
This is why you pay someone. Saving $950/month means it's well worth spending $500 for a day of someone's time occasionally. You don't have to do everything internally when you run a startup. Buying in services that you only need occasionally is well worth the money.
Are there contractors out there that will take on call shifts? Because it seems unlikely, and if your proposal is "put into a production a system that you will have to spend $500/day every time it goes down and wait 2-4 business days for a resolution" then you're a braver person than I am.
Obviously not. You don't pay someone to just set it up. You pay to help do what you'd do if you had a dedicated devOps teams. You pay someone to set up the system with your team so they understand it, train your team to use it, write some documentation about it, script a rollback procedure, maybe help on developing playbooks, etc.
Besides, there are people out there who offer on-call services for a small retainer.
This is optimistic to say the least. I've worked as an SRE for 5 years and apart from the others in the team the devs don't have nearly as much knowledge. There's no way I'd rely on them to fix an outage.
And even on a small retainer you'd better hope they retained the knowledge of how all that stuff works if you're only calling on them every now and again.
The idea is devops as _culture_ - I come in as an expert and set it up, then show them how I did it, then run through various disaster recovery scenarios so they learn it and can handle the vast majority of problems.
And you'd be surprised how little problems you might have - I've had many VMs with literally years of uptime running without issue.
Most people focus on "devops" as a job - and they never bother to teach the rest of the team anything about how stuff works. Worse than that, modern clouds encourage you to build opaque and complicated systems which probably only someone working on them full-time has a hope to understand...
If it was so easy for devs to pick up SRE companies wouldn't be struggling to find good people.
Culture doesn't mean they'll know how to fix <some weird edge case> at 3am in the morning. The SRE with constant ops exposure is likely to have a much better chance, if only because they (ought to) really know how to debug it.
> I've had many VMs with literally years of uptime running without issue.
I hope they all have fully patched base libraries and kernels, security auditing is getting to a much more common requirement these days even among very small companies. For example, anybody using Facebook Login.
And even on a small retainer you'd better hope they retained the knowledge of how all that stuff works if you're only calling on them every now and again.
This is why we document things at the company I work for. If you're serious about a project and want it to exist for a long time there will be things that only come up every few years, and you won't remember them. You can either write things down or you have to relearn how things work every time. Wroting things down is much easier.
> I'm very much with you on this, but I do understand that it's one of those things that is just not feasible when your team has no sysadmin/devops experience.
But this applies to everything. You also need Heroku or Kubernetes or whatever experience to maintain those systems, right?
The question is how much you need to know. You need much less knowledge to run something in Heroku than in k8s; dramatically less in the "onboard a CRUD app" case. I'd argue that running k8s effectively is no less knowledge-intensive than running VMs.
> I'd argue that running k8s effectively is no less knowledge-intensive than running VMs.
LOL - are you f'ing kidding?
Running VMs is much closer to running your app locally than containers and k8s - it's not that hard to do badly, and only marginally harder to do well.
> I'm very much with you on this, but I do understand that it's one of those things that is just not feasible when your team has no sysadmin/devops experience.
Exactly - $950/mo is nowhere near enough to pay for those skills if you don't have them. It's good value for money.
I say it's over-rated. You don't need a huge amount of "sysadmin/devops", that's only now becoming a thing since we started calling it that. Before it used to be that backend devs just had intimate knowledge of how their service was running on a system, and most likely had to login and debug a multitude of issues with it. 99% of backend devs used to (maybe not so anymore now with "devops" et al) be more than capable of administering a system and keeping it chugging along as well as setting it up. It might not be 100% bulletproof or consistent or whatever, but more than enough for a company or service starting up.
We've lost that, and now everyone thinks we need "sysadmin/devops" for simple and moderately complicated deploys. Heck, most of the guides are out there, follow them. Also, ready-made images and docker containers are amazing these days with config built in. If you look-for or hire devops, you get K8s and all the "formal" items they'll bring with them. You don't need CI/CD for the first 6 months, just PoC the damn thing, get an MVP out and start getting users/customers, the rest will follow.
I've worked in DevOps for a while and if I could pay $950 to not run and maintain a server then I'd consider it money well spent.
There's always 1-2 comments in these threads that advise using a Linode VM or Hetzner dedicated server in order to save money; but they are really skipping over the headaches that come with building and maintaining your own servers.
- Are the server provisioning scripts source controlled?
- Is there a pipeline if the server needs to be recreated or upgraded?
- How is the server secured?
- Are the server logs shipped somewhere? Is there monitoring in place to see if that is working?
- Does the server support zero-downtime deployments?
- Are backups configured? Are they tested?
I imagine the answer to a lot of these questions is no and to be fair not all PaaS systems provide all of these features (at least not for free).
The server becomes a pet, issues inevitably arise and that $950 starts to look like a false-economy.
I think this all ignores the opaqueness of most PaaS providers - when everything is on a single box, you have infinite observability, standard Linux perf analysis tools, etc.
If you do it correctly, it is loads easier to reason about and understand than a complex k8s deployment.
The amount of traffic you can serve on a bunch of Linode boxes is pretty high. I know this sounds like a boomer yelling at the clouds --see what I did here?
Kubernetes is the right solution to a difficult problem you may or may not have in your future.
> In the limit, there are some startups that could run production on a single Linux host
I guess redundancy is not really a thing then?
With serverless offerings you can get rather good deals, I don't think you need your own K8S cluster if you can get away with a single Linux host, but a single Linux host is pretty pricey maintenance wise compared to google Cloud Spanner and cloud run.
Most of the time it really doesn't need to be. In the end what you care about is uptime and cost. A redundant solution doesn't have a perfect uptime just because it's redundant, in fact sometimes it might have even less uptime because of failures in the redundancy mechanism. Of course if you need to be always up it might be worth it. But for a lot of situations some downtime is acceptable and most of the time it's better to be on a simpler setup that's easier to debug and less costly to maintain, than to be on a complex one that requires more maintenance and still doesn't have perfect uptime.
Also, I wouldn't underestimate the cost of maintenance in managed solutions compared to self-hosted. It wouldn't be the first time that the managed solutions screw something up or apply some changes and you don't know whether it's your fault or theirs. You also have to add the cost of adapting your solution to their infrastructure, which is not trivial either.
> A redundant solution doesn't have a perfect uptime just because it's redundant, in fact sometimes it might have even less uptime because of failures in the redundancy mechanism
I'm fairly sure that google does a better job keeping cloud spanner and cloud run working, and their redundancy mechanisms working, than whoever runs your single linux box will do.
> most of the time it's better to be on a simpler setup that's easier to debug and less costly to maintain
Keeping a whole linux box running is more complicated and requires more maintenance than Cloud Run and Cloud Spanner.
> Also, I wouldn't underestimate the cost of maintenance in managed solutions compared to self-hosted.
It is not 0, it is just that if you manage cloud run and cloud spanner you don't manage all the other things you have to manage when you self host, and managing cloud run and spanner is really not a lot of effort, it is a lot less effort than managing a standalone database at least.
> You also have to add the cost of adapting your solution to their infrastructure, which is not trivial either.
Cloud run can run stock standard docker containers, you will have a bad time if your processes are not stateless though, and you will have the best time if you have a 12-factor app, but I would not count that as adapting to infrastructure.
> > A redundant solution doesn't have a perfect uptime just because it's redundant, in fact sometimes it might have even less uptime because of failures in the redundancy mechanism
> I'm fairly sure that google does a better job keeping cloud spanner and cloud run working, and their redundancy mechanisms working, than whoever runs your single linux box will do.
You'd be surprised - the major cloud providers have outages all the time.
Google in particular will have some random backing service firing 502's seemingly randomly while their dashboards say "all good".
> I'm fairly sure that google does a better job keeping cloud spanner and cloud run working, and their redundancy mechanisms working, than whoever runs your single linux box will do.
Most of the uptime loss won't come from your provider but from your applications and configuration. If you use Cloud Run and mess up the configuration for the redundancy, you'll still have downtime. If your application doesn't work well with multiple instances, you'll still have downtime.
> Keeping a whole linux box running is more complicated and requires more maintenance than Cloud Run and Cloud Spanner.
Is it? I keep quite some linux boxes running and they don't really require too much maintenance. Not to mention that when things do not work, I have complete visibility on everything. I doubt Cloud Run provides you with full visibility.
> It is not 0, it is just that if you manage cloud run and cloud spanner you don't manage all the other things you have to manage when you self host, and managing cloud run and spanner is really not a lot of effort, it is a lot less effort than managing a standalone database at least.
Managing a simple standalone database is not a lot of effort. For most cases, specially the ones that can get away with running production on a single box, you'll be ok with "sudo apt install postgresql". That's how much database management you'll have to do.
> Cloud run can run stock standard docker containers, you will have a bad time if your processes are not stateless though, and you will have the best time if you have a 12-factor app, but I would not count that as adapting to infrastructure.
That definitely counts as adapting to infrastructure. For example, if I want to use Cloud Run my container should start fairly quickly, if it doesn't I need an instance running all the time which increases costs.
I'm not saying Cloud Run/Spanner are bad. They'll have their use cases. But for simple deployments it's more complexity and more things to manage, and also more expensive. If doing "apt install postgres; git pull; systemctl start my-service" works and serves its purpose, why would I overcomplicate it with going to redundant systems, managed environments and complex distributed platforms? What do I stand to gain and at what cost?