Hacker News new | past | comments | ask | show | jobs | submit login

From my experience, the $500 Intel NUC in my basement has greater reliability than nearly every single company which calls their IT departments "SREs" now.

As an old school IT professional, I would recommend considering most of these companies' advice "how not to engineer reliability".

It's possible beyond a certain scale real reliability just isn't possible anymore, but then, anyone below that scale should also avoid SRE practices. Aka, do not use this for your side project.




company which calls their IT departments "SREs" now.

My org recently made a bunch of people SREs, with no one leading the cause except for a Director who only involves themselves enough to be effectively seen as a taskmaster, and otherwise is forgetful, disorganized, egotistical, lacks follow-through, constantly spreads his team too thin, and has put my direct boss in a REALLY terrible spot more than one after my boss did exactly what he was instructed to.

A bunch of people were moved from roles they previously excelled at into this new org structure and called SREs. Hired a few more off the street. What are they doing? 25% help desk tasks, 25% break fixes on legacy systems, 40% fixing their own damn laptops, 10% having panic attacks. Roadmaps? Lol. Planning? Lol. SLO? SL-NO. Error budget? I dunno, expense it.

We had to have a zoom to explain and get approval to use Terraform this morning. Not even Terraform enterprise (which is a cost), just plain Terraform.

Management essentially threw a bunch of shit at a wall, stood back, watched as it slid down towards the carpet and but found themselves pondering why the shit wont stick.

Short of using my main account which is a bit too revealing, I have NO qualms admitting a brutal truth about the situation in which I find myself: I am milking this company for every ounce of knowledge and upskill-ing my dev chops and I am gone come Summer.

The pay is JUST good enough that I can take my time, be VERY selective about the next job because good lord if I end up in another shop like this ever again it will be 100% my fault. I'm about to start being stupidly picky about who gets my labor from now on, my only regret is not seeing this value sooner.

Sorry for the rant but I'm mad and snowed in and that part of your comment really brought some shit up.


This is one of the best comments from a green handle I’ve seen in a while. Rant or not, it adds to the site; just this weekend, I was lamenting some very low quality green account comments and contemplating whether a 12-hour waiting period or de-ranking of green would make HN better. This argues strongly against.


Having a similar situation myself actually. In a team of SREs and "SRE" only in name. Just a bunch of ops guys, and being on the bottom of food chain[1]. When I came over there, and I shit you not, there were pager duty alerts every 15 seconds at peak times, multiple ones at that as well. And for years they thought it completely fine. Took me 6 months of employing best of soft skills to be even allowed to tweak some(!) of those alerts. Believe or not, we have tens of millions of users each day and at least 50 "micro" services. It is a complete mess with a lot of reliability problems in the last couple of months (failed projects getting "second chances" - sunken costs fallacies etc). Every few weeks some dev teams organizes meeting with us and basically say: "hey guys, so you've earned [newspeak for we're giving you our shit we don't want to support and move on to next big thing] this service, you're now the proud owners of it, glhf".

I'm also like you, preparing for my next job and this is the first time in my career that I actually know what questions to ask on the next job interview. The experience here is that bad. The only problem I have right now is that I'm so ridiculously micromanaged at work and interrupted every few minutes (or more) that I just cannot focus on learning anything of substance. So I'm trying to do it my free time. I'll get there, but I agree, some changes you're trying to do on a job take years to get there (even if those happen). Life is too short to change a lot of short sighted, entrenched people.

[1]: word of caution for anyone: you cannot be a decision maker (and SREs are supposed to at least be at the place of decision making) if you're on the bottom of the hierarchy. So be wary of jobs with great sounding titles where you cannot change anything for the better.


One of the more ridiculous comments I’ve ever seen on HN. It might be the most ridiculous.

I’m guessing you can build AWS in a weekend and it would only cost a few thousand or less in hardware costs, and you can host it all on your cable connection.


> I’m guessing you can build AWS in a weekend

The OP did mention "scale" in his/her comment, the very great majority of projects now being started won't get to AWS's size hence they won't need to use AWS's way of doing things (or Google's, or Facebook's).


Funnily enough AWS' way of doing things is quite different to Google's and I think there are pros and cons to both.

Google's way is SRE as mentioned here, and I don't have experience to say more there. However, at least in my bubble of the world, AWS's way of "you build it, you run it" is quite popular (and quite effective IMO) for small companies up to any scale.


The Google SRE approach isn't that different. You can ship systems that don't follow the SRE playbook, SRE just aren't going to take on-call for it.

If you want a different team to be on-call for your application though, there are baseline standards that you have to comply to, and if you breach those standards down the line they're going to hand the pager back to you until you're up to scratch.


AWS does have something similar to SRE’s though, at least in terms of skill sets. AWS has system development engineers (sysdevs) and systems engineers. When we made the role of SysDev we specifically chose not to call it SRE because we didn’t want people to think of it as a google style SRE.

The intent of SysDev is to create and maintain the internal, non-customer facing services. This includes writing code and creating services that maintain the reliability of the service/system. It’s usually related to the infrastructure in some way, whether it’s servers or networking but also expands to understanding how all the different sub systems of the AWS product work together.

The core difference here between sysdevs and SREs is that SREs often take over a product from an SWE team once it’s reliable and maintain / improve it. Sysdevs create an internal product and maintain it through the life of it.

Of course in AWS not all orgs follow the intent and often implement the role differently.


> I’m guessing you can build AWS in a weekend

No, but a service on my NUC has better uptime than a service on AWS, and costs less, so why would I want to build AWS?


Because your service neither could handle a million concurrent users tomorrow nor a power outage. Those probably aren't your requirements, but they are for most companies.


Only a tiny portion of companies ever reach a million concurrent users tomorrow, nor have plans to reach them, having enough to pay the bills is already quite good.

A power outage can be dealt just like in the old days with a UPS unit.


> Because your service neither could handle a million concurrent users tomorrow nor a power outage.

SRE here.

92% of all companies can't handle this either.


> million concurrent users

sure, but 99% of sysadmins/devops/SRE/whatever will never work on anything that has million concurrent users.

Most startup will never reach million concurrent users. And if they do, investors will happily shuffle as much money as you need to make your site work at that scale.

Hell even million monthly users is a nice milestone that most projects never reach, and that usually translates to couple of thousand concurrent request (peak), that average laptop could handle.


I absolutely have a UPS in my basement. Considering how little runs on it, it's pretty cheap to get a long runtime out of it too.

A lot of things I see with millions of concurrent users aren't actually monoliths: Sure, Facebook needs to handle that. But most cloud apps would better be run where each tenant/business/team operates an Intel NUC in their basement, instead of the developer using the cloud as a way to force rent-seeking behavior.


Not having to deal with hardware is nice, and I don't think having datacenter grade internet access in your basement is realistic for most.


Do you need datacenter grade? Fiber can probably serve a lot of requests per second


Unless your fiber has an outage, then you want redundancy, multiple independent uplinks that is.

And if your whole region has a problem, which is more likley to happen than one might think, then you want a multi-region setup e.g. us-west-1 and us-east-2, and then we can start to calculate the numbers of nines, unless your username is ocdtrekkie, he can beat AWS with a single NUC while he is sleeping.


Many big things started out in a garage, with very simple solutions like your little UPS powered NUC.


Including Google. Who have now hit a point where that's no longer sustainable, and developed a set of best practices to ensure reliability beyond what you can expect from a NUC in the basement.


SRE is just buzz word for managing/operating components and infrastructure at software companies. You can have companies that do it poorly and ones that don't. If it's an IT department that had been rebranded as sre, then yea, they would probably not be able to do as well as ones that are staffed properly. There are some overlapping skills, but to do it well, you need to understand the software side as well as the infrastructure.

How would you suggest that a software development team (one of many such teams in a company) that has no experience scaling or making software resilient operate? To just make a blanket statement saying that all sre practices are bad, seems unfair. Most companies won't be able to get the talent required to do it well in each and every team. Having specialized individuals that can help guide those teams seems logical.

The sre team can also cover the basics needed for a software company to operate with velocity and reliability. It doesn't really matter whatever you call it, but having the basics like logging, metrics, distributed tracing, dashboarding, and alerting managed well by one centralized team will help allow the component developers to focus on their components and not have to worry about all that other stuff.

The old way of having IT and developers complete separate was crap. Working together by using devops, sre, or whatever you want to call it has in my experience been so much more helpful in building and scaling companies.


That's because no one uses the nuc in your basement.


From my experience, the $500 Intel NUC in my basement has greater reliability than nearly every single company which calls their IT departments "SREs" now.

Probably 99% of the SRE problems I've seen are not IT problems. They're usually bugs or shortcomings in code some engineer deployed.


Well, either that or DNS


Apples and oranges though; if your business can run on a $500 NUC and if it failing is not a problem, then by all means go for it.

But if you look at scaled companies, you'll see a different picture. One of AWS's sales pitches is that setting up a datacenter is a big upfront expense, you need the space, hardware, and personnel to build it, and you have to provision it for peak load.

Take e.g. a recent game like Fall Guys, that went from 0 to millions of concurrent players within days, maybe even hours. Can't run that on your $500 NUC. You couldn't buy and provision enough NUCs to keep up even if you tried.

Anyway once again, if that one works for you then stick with it. I too prefer to not go overboard with scalability and the fancy technologies of today if I can help it.


Small shops shouldn’t use SRE practices, that’s for sure. At small scale your infra is like a house with a couple of pets then - hand managed. But once you reach scale, you run a huge farm with a thousands of cattle. They require different approach, not only because of the cost (too many people would be needed), but also of the requirements (try to change anything, when you have that many people involved).


> Small shops shouldn’t use SRE practices

Hard disagree.

Properly architecting software, setting reasonable SLAs/SLOs and trying to achieve them/doing postmortems when not, reducing toil, proper monitoring etc are good practices for any company that serves customers. Spend (both time and effort) relative to your company size and resources and it'll serve you well.

(Disclaimer: Google SRE, opinions are my own)


Yep. As an ex-Amazonian who just joined a small startup, you can say this keeps me up at night in more ways than one.

Not in the least because we've apparently already made agreements with 2-person SaaS companies with no consideration to SLAs whatsoever.


True, but I think a lot less businesses should be operated as large farms.


The nuc is ofc a bit in jest, but I still see where you’re coming from.

If you take service delivery serious then SRE’ing and DevOps’ing is inherent in the ways you work and the leadership. You use metrics to continuously improve and don’t leave things hanging. This requires a lot of work and most likely some custom tooling and a LOT of automation and strict conventions.

If you’re more of a traditional IT shop where people are winging it - separate network team, dedicated “vmware” team and a few guys doing “storage”... well you’ve been left in the dust. Even if you dedicate a team but rely on the above organization - it will be an unreliable mess. And most likely slow moving - which is why stuff gets funneled to the “cloud” although for most uses it is more expensive and lower performant. It’s worth it because you no longer have to deal with the above... it saddens an old school guy like myself.

This is at least my experience from big to small, tech through enterprise.

Edit: you’re either looking at IT as cost center or a strategic asset. This is where the different approaches and delivery models branches out from.


And you can implement Dropbox in a weekend with a NAS.


It’s not too difficult, if you ignore all the things which made Dropbox successful.


Just use rsync via ssh, right?


SRE, at least as it is applied at the companies where I've worked, isn't about increasing uptime at all cost---it's about hitting some reliability (or other) target given a long list of constraints. A lot of things are possible if your service can fit on a single box, but (at least where I work) that would usually violate some constraint I have.


Along the same lines, if you can avoid a distributed architecture, things get a lot more reliable. You can get a crazy amount of RAM, SSD, CPU cores on a single machine. If you run your system on a powerful machine with some other ones on hot standby, a lot of complexity goes away.


If you can run your system on a single machine, you don't need an SRE.

If you have hundreds or thousands of machines, that's an indicator that you /may/ have the complexity that requires the disciplines that can come from dedicated SRE. The tough thing is conflating filling operations problems with a role named SRE, versus actually using the best practices that will help you scale and improve reliability.


Your reservation rate goes to 200% though (a full hot standby) instead of, say, 120% to accommodate for some nodes becoming unavailable.

If your hot standby is a$100/mo VM, it's not noticeable. If it's $5000/mo, less so.

To say nothing of scaling up and down with the load — which, of course, you only need if you are a pretty large operation.


If you’re paying Bay Area prices for engineers, $5K a month is a steal to not have to pay people to deal with sharding.


Not even Bay Area prices. A Jr. SWE after overhead (benefits, HR, laptop, office-space,...etc) is easily costing the company 150k+/year in most markets.


Indeed, but take it a step further. Two ten thousand dollar servers in your basement with UPS and some rudimentary failover configuration is basically fire and forget. Remote in monthly and install updates. Done.

It'll run for ten years for next to nothing.


Until there's a power outage, flooding, malice, etc.

I think the main issue is that the cloud providers don't publish much about outages that don't affect the end-user. I mean a failed hard drive happens all the time, but S3 is never affected by that.


Depends on your bandwidth requirements. Also, if you want even higher reliability, you might consider getting two independent internet links into your basement, which is pretty doable in an urban setting.


But they wont have diverse routing :-) all it takes is a navvy with a back hoe digging in the wrong place.

And you also need to have diverse routing for power coming in and generator / battery room set up.


Run the backup from your friend's basement in the next town over using a different ISP. You can run the backup for them.


For a long time, my off-site backup was at my grandmother's house because it was the furthest geographic location I could give someone a box who would leave it plugged into their Internet. ;)


As an old school IT professional, I would recommend considering most of these companies' advice "how not to engineer reliability".

Where I work the classic on-prem bare metal and ESX based systems are just massively more reliable than on-prem or cloud Kubernetes, and take far less people to operate. 3 or 4 9’s is easy with ESX and 5 is do-able. Kubernetes barely makes it into 1 9 and might not even do that! Still it employs a lot of engineers, and they probably get paid more than the crusty old ESX guys too! From that point of view you should definitely push for it.


Indeed. Not that you should, but you could leave an ESXi cluster running without maintaining it at all, and it'll probably keep going for five or ten years all on it's own. It's stable by design.


I tried that once, only forgot to position near the ceiling. Second year in the flat I had the unlucky experience of high ground water levels which flooded the basement. Down system and lost all data :(


You're absolutely right, but you're being downvoted because a lot of HN readers' sense of self-worth as engineers is tied up with the idea that the cloud is the right way to do things.


I crash my car much less often than Nascar drivers. They must be terrible drivers.


I've never been in hospital so I don't need to pay health insurance anymore.


I fail to see how a $500 Intel NUC could possibly justify a team of software engineers to maintain, which means the person responsible for setting it up is not a manager, and their boss isn't someone who manages a manager.

The purpose of an employee is not to jack up the stock price, it's to make their boss look more important. Shareholders are like customers, it pays to have them think you're on their side, but your interests do not always align.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: