Ask HN: Do you find working on large distributed systems exhausting?

fxtentacle · on Feb 19, 2022

Yes, I used to,

but No, I fixed it :)

Among other things, I am team lead for a private search engine whose partner-accessible API handles roughly 500 mio requests per month.

I used to feel powerless and stressed out by the complexity and the scale, because whenever stuff broke (and it always does at this scale), I had to start playing politics, asking for favors, or threatening people on the phone to get it fixed. Higher management would hold me accountable for the downtime even when the whole S3 AZ was offline and there was clearly nothing I could do except for hoping that we'll somehow reach one of their support engineers.

But over time, management's "stand on the shoulders of giants" brainwashing wore off so that they actually started to read all the "AWS outage XY" information that we forwarded to them. They started to actually believe us when we said "Nothing we can do, call Amazon!". And then, I found a struggling hosting company with almost compatible tooling and we purchased them. And I moved all of our systems off the public cloud and onto our private cloud hosting service.

Nowadays, people still hold me (at least emotionally) accountable for any issue or downtime, but I feel much better about it :) Because now it actually is withing my circle of power. I have root on all relevant servers, so if shit hits the fan, I can fix things or delegate to my team.

Your situation sounds like you will constantly take the blame for other people's fault. I would imagine that to be disheartening and extremely exhausting.

aeyes · on Feb 19, 2022

I feel that your problems aren't even remotely related to my problems with large distributed systems.

My problems are all about convincing the company that I need 200 engineers to work on extremely large software projects before we hit a scalability wall. That wall might be 2 years in the future so usually it is next to impossible to convince anyone to take engineers out of product development. Even more so because working on this changes absolutely nothing for the end user, it is usually some internal system related to data storage or processing which can't cope anymore.

Imagine that you are Amazon and for some scalability reason you have to rewrite the storage layer of your product catalog. Immediately you have a million problems like data migration, reporting, data ingestion, making it work with all the related systems like search, recommendations, reviews and so on.

And even if you get the ball rolling you have to work across dozens of different teams which can be hard because naturally people resist change.

Why do large sites like Facebook, Amazon, Twitter and Instagram all essentially look the same after 10 years but some of them now have 10x the amount of engineers? I think they have so much data and so many dependencies between parts of the system that any fundamental change is extremely hard to pull off. They even cut back on features like API access. But I am pretty sure that most of them have rewritten the whole thing at least 3 times.

ratww · on Feb 19, 2022

> Why do large sites like Facebook, Amazon, Twitter and Instagram all essentially look the same after 10 years but some of them now have 10x the amount of engineers? I think they have so much data and so many dependencies between parts of the system that any fundamental change is extremely hard to pull off. They even cut back on features like API access. But I am pretty sure that most of them have rewritten the whole thing at least 3 times.

I used to work on a Unicorn a few years ago, and this hits close to home. From 2016 to 2020, the pages didn't change one single pixel, however there we had 400 more engineers working on the code and three stack iterations: full-stack PHP, PHP backend + React SSR frontend, Java backend + [redacted] SSR frontend (redacted because only two popular companies use this framework). All were rewrites, and those rewrites were justified because none of them was ever stable, the site was constantly going offline. However each rewrite just added more bloat and failure points. At some point the three of them were running in tandem: PHP for legacy customers, another as main and another on an A/B test. (Yeah, it was a dysfunctional environment and I obviously quit).

axiosgunnar · on Feb 19, 2022

> Yeah, it was a dysfunctional environment and I obviously quit

What do you think could management have done better to make it not dysfunctional and have people quitting?

ratww · on Feb 19, 2022

I think just common sense and less bullshit rationalisation would have been enough.

They had a billion dollars in cash to burn, so they hired more than they needed. They should have hired as needed, not as requested by Masayoshi Son.

They shouldn't be so dogmatic. Some teams were too overworked, most were underworked (which means over-engineering will ensue), but no mobility was allowed because "ideally teams have N people".

They shouldn't be so dogmatic pt 2. Services were one-per-team, instead of one-per-subject. So yeah, our internal tool for putting balloons and clowns into images lived together with the authentication micro-service, because it's the same team.

Rewriting everything twice without analysis was wrong. The rewrites were because previous versions were "too complex" and too custom-made but newer ones had an even more complex architecture, but "this time it's right, software sometimes need complexity".

Believing that some things were terrible would have gone a long way. Launching the main node.js server locally would take 10 to 20 minutes to launch, while something of the same complexity would often take about 2 or 3 seconds. Of course it would blow up in production! Maybe try to fix instead of ordering another rewrite.

They were good people, I miss the company and still use the product, but it didn't need to be like this.

webmaven · on Feb 20, 2022

> They shouldn't be so dogmatic pt 2. Services were one-per-team, instead of one-per-subject.

Where the heck did this come from? AIUI, the ideal is supposed to be one-team-per-service, not one-service-per-team.

ratww · on Feb 20, 2022

It comes from a dogmatic reaction against microservices. Microservices were problematic in certain ways, but instead of analysing what went wrong and why, they just went the opposite direction and started doing "big services only". It was a misguided approach, plain and simple.

Interestingly due to internal bureaucracy and understaffing in some teams, there was a lot of "multiple-teams-per-service", which yeah, is another issue in itself.

akkartik · on Feb 19, 2022

Favorited (https://news.ycombinator.com/favorites?id=akkartik&comments=...)

dasil003 · on Feb 19, 2022

I don't know your specifics, but I have worked on some large scale architecture changes, and 200 engineers + 2 year feature freeze is generally not a reasonable ask. In practice you need to find an incremental path with validation and course correction along the way to limit the amount of concurrent change in flight at any moment. If you don't do this run a very high risk of the entire initiative collapsing under its own weight.

Assuming your estimation is more or less correct and it really is a 400 eng-year project, then you also need political capital as well as technical leadership to make it happen. There are lots of companies where a smart engineer can see a potential path out of a local maximum, but the org structure and lack of technical leadership in the highest ranks means that the problem is effectively intractable.

trhway · on Feb 19, 2022

>I need 200 engineers to work on extremely large software projects before we hit a scalability wall. That wall might be 2 years in the future

sounds like a typical massive rewrite project. They almost never succeed, many fail outright and most hardly even reach the functionality/performance/etc. level of the stuff the rewrite was supposed to replace. 2-4 years is typical for such glorious attempt before being closed or folded into something else. Management in general likes such projects, and they declare victory usually around 2 years mark and move on on the wave of the supposed success before reality hits the fan.

>to convince anyone to take engineers out of product development.

that means raiding someone's budget. Not happening :) New glorious effort needs new glorious budget - that is what management likes and not doing much more on the same budget as you're basically suggesting (i.e. i'm sure you'll get much more traction if you restate your proposal as "to hire 200 more engineers ..." because that way you'll be putting serious technical foundation for some mid-managers to grow :). You're approaching this as an engineer and thus failing in what is the management game (or as Sun Tzu was pointing out one has to understand the enemy).

fxtentacle · on Feb 19, 2022

My impression has always been that FAANG need lots of engineers because the 10xers refuse to work there. I've seen plenty of really scalable systems being built by a small core team of people who know what they are doing. FAANG instead seem to be more into chasing trends, inventing new frameworks, rewriting to another more hip language, etc.

I would have no idea how to coordinate 200 engineers. But then again, I have never worked on a project that truly needed 50+ engineers.

"Imagine that you are Amazon and for some scalability reason you have to rewrite the storage layer of your product catalog." Probably that's 4 friends in a basement, similar to the core Android team ;)

danny_taco · on Feb 19, 2022

Your impression comes from the fact that you have not worked at larger teams, as you said so yourself. It's relatively easy to build something scalable from the beginning if you know what you need to build and if you are not already handling large amounts of traffic and data.

It's a whole different ballgame to build on top of an existing complex system already in production that was made to satisfy the needs at the time it was built but it now needs to support other features, bug fixes and supporting existing features but at scale while having 50+ engineers not step on each other and not break each others code in the process. 4 friends in the basement will not achieve more than 50+ engineers in this scenario, even when considering the inefficiencies of the difficulty in communication that come along with so many minds working on the same thing.

ratww · on Feb 19, 2022

GP said they have never work on something that truly needed 50+ engineers. Truly being the keyword here IMO.

I have worked on a 1000+ engineer project and another that was 500+, but I'm on the same boat as GP. Both of those didn't needed 50+, and the presence of the extra 950/450 caused several communication, organisational and architectural issues that became impossible to fix on the long term.

So I can definitely see where they're coming from.

exikyut · on Feb 19, 2022

I've long wondered what I might be able to keep an eye out for during onboarding/transfer that would help me tell overstuffed kitchens apart from optimally-calibrated engineering caves from a distance.

I'm also admittedly extremely curious what (broadly) had 1000 (and 500) engineers dedicated to it, when arguably only 50 were needed. Abstractly speaking that sounds a lot like coordinational/planning micromanagement, where the manglement had final say on how much effort needed to be expended where instead of allowing engineering to own the resource allocation process :/

(Am I describing the patently impossible? Not yet had experience in these types of environments)

ratww · on Feb 19, 2022

> a lot like coordinational/planning micromanagement, where the manglement had final say on how much effort needed to be expended where instead of allowing engineering to own the resource allocation process

Yep, that's a fair assessment!

The 1000+ one was an ERP for mid-large businesses. They had 10 or so flagship products (all acquired) and wanted to consolidate it all into a single one. The failure was more on trying to join the 10 teams together (and including lots of field-only implementation consultants in the bunch), rather than picking a solid foundation that they already owned and handpicking what needed.

The 500+ was an online marketplace. They had that many people because that was a condition imposed by investors. People ended up owning parts of a screen, so something that was a "two-man in a sprint" ended up being a whole team. It was demoralising but I still like the company.

I don't think it's impossible to notice, but it's hard... you can ask during interviews about numbers of employees, what each one does, ask for examples of what each team does on a daily basis. Honestly 100, 500, 1000 people for a company is not really a lot, but 100, 500, 1000 for a single project is definitely a red flag for me now, and anyone trying to pull the "but think of the scale!!!" card is a bullshit artist.

exikyut · on Feb 20, 2022

Yay, I'm learning :D

> trying to join the 10 teams together

oh no

(insert https://webcomicname.com/ here)

> rather than picking a solid foundation that they already owned and handpicking what needed.

Mmmm.

I wonder if a close alternative (notwithstanding lack of context to optimally calibrate ideas off of) might have involved leaving all the engineers alone to compare notes for 6-12 months with the singular top-down goal of "decide what components and teams do what best." That could be interesting... but it leans very heavily on preexisting competence, initiative and proactivity (not to mention conflict resolution >:D), and is probably a bit spherical-cow...

> The 500+ was an online marketplace. They had that many people because that was a condition imposed by investors.

*Constructs getaway vehicle in spare time* AAAAAaaaaaa

Sad engineering face :<

> I don't think it's impossible to notice, but it's hard... you can ask during interviews about numbers of employees, what each one does, ask for examples of what each team does on a daily basis.

Noted. Thanks.

> Honestly 100, 500, 1000 people for a company is not really a lot, but 100, 500, 1000 for a single project is definitely a red flag for me now, and anyone trying to pull the "but think of the scale!!!" card is a bullshit artist.

That makes a lot of sense, and also filed away.

Also, I recently read this which resonates quite strongly with the economy-of-efficiency scale problem (which I totally agree with): https://rachelbythebay.com/w/2022/01/26/swcbbs/, and the update, https://rachelbythebay.com/w/2022/01/27/scale/

ethbr0 · on Feb 19, 2022

> what I might be able to keep an eye out for during onboarding/transfer that would help me tell overstuffed kitchens apart from optimally-calibrated engineering caves from a distance

The biggest thing I've been able to correlate are command styles: imperative vs declarative.

I.e. is management used to telling engineering how to do the work? Or communicating a desired end result and letting engineering figure it out?

I think fundamentally this is correlated with bloat vs lean because the kind of organizations that hire headcount thoughtlessly inevitably attempt to manage the chaos by pulling back more control into the PM role. Which consequently leads to imperative command styles: my boss tells me what to do, I tell you, you do it.

The quintessential quote from a call at a bad job was a manager saying "We definitely don't want to deliver anything they didn't ask for." This after having to cobble together 3/4 of the spec during the project, because so much functionality was missed.

Or in interview question form posed to the interviewer: "Describe how you're told what to build for a new project." and "Describe the process if you identify a new feature during implementation and want to pitch it for inclusion."

exikyut · on Feb 20, 2022

Of course. Wow, I never thought about management like that before. But particularly in software development it makes so much sense for people to jump toward this sort of mindset.

There really is an art to scaling problems to humans so the individual work (across management and engineering) falls within the sweet spot of cognitive saturation. TIL yet another dimension that can go sideways.

The signal to noise ratio is very appreciated.

fragmede · on Feb 19, 2022

Yeah, exactly. There is overhead simply because of the (necessary) cross-communication at that scale, and there's overhead from legacy support, but here's a thought experiment. Imagine that you've built the most perfect system from scratch that you can think of. Fast forward five years, and the business has pivoted so many times that system is doing all sorts of stuff it just wasn't designed for, and it's creaky and old. It just doesn't fit right anymore and even you want to throw it away and build a new one. So you form a tiger team full of the smartest people you know to greenfield build a new one, from scratch, but that's gonna take two years to write. (You think, hey, maybe we could just take this open source thing and adapt it to our purposes. To which I say, where do you think large open source projects come from‽)

How do you bridge the two systems? You build an interim system. But customers want new features, so those features need to be done twice (bridge+new) if you're lucky, three times (existing+interim+new) if not. Could a smaller team of 10x engineers come in and do better? First off, thanks for insulting all of us, as if none of us are 10x-ers. But no. There's simply not enough hours in the day.

We've all heard of large IT projects that failed to land and said "of course". But we don't hear about the huge ones that do. And plenty of them do land, quite succesfully, with these 200+ person teams where I, as an SRE, don't know the code for the system I'm supporting.

None of this is visible from the outside.

aij · on Feb 19, 2022

> I've seen plenty of really scalable systems being built by a small core team of people who know what they are doing.

There is huge difference between building a system that could theoretically be scaled up and actually scaling it up efficiently.

At small scales, it's really easy to build on the work of others and take things for granted without even knowing where the scaling limits are. For example, if I suddenly find I need to double my data storage capacity, I can drive to a store and come back with a trunk full of hard drives the same day. I can only do that because someone already build the hard drives, and someone stocked the nearby stores with them. If a hyperscaler needs to double their capacity, they need to plan it well in advance, allocating a substantial fraction of global hard drive manufacturing capacity. They can't just assume someone would have already built the hardware, much less have it in stock near where it's needed.

danielmarkbruce · on Feb 20, 2022

Which FAANG is rewriting to another hip language and chasing trends (especially when it comes to infra services??)? I don't mean to be rude, but it doesn't sound like you are talking about any of the FAANGs, this sounds completely made up.

fxtentacle · on Feb 20, 2022

https://blog.pragmaticengineer.com/uber-app-rewrite-yolo/

danielmarkbruce · on Feb 21, 2022

FAANG is an acronym for Facebook, Amazon, Apple, Netflix, Google. Uber isn't in the same ballpark as those companies (arguably Netflix isn't really in the same ballpark as the other four either...).

andai · on Feb 19, 2022

Heh, I wish they still looked the same. They added an order of magnitude of HTML and JS bloat while removing functionality.

pojzon · on Feb 19, 2022

Had that issue in my previous job.

Higher management decided to migrate our properitary vendor locked platform from one cloud provider to the other one. Majority of migration fell on a single platform team that was constantly struggling with attricion.

Unfortunately I was not able (neither our architects) to explain the higherups that we need bigger team and overall way more resources to pull that off.

Hope that someone that comes after me will be able to make the miracle happen.

notimetorelax · on Feb 19, 2022

I usually move on to a different project/team/company when it gets to this. E.g. my new team builds a new product that grows like crazy and has its own set of challenges. I prefer to be deliver immediate customer value vs. long term hard to sell and hard to project the value work.

ClumsyPilot · on Feb 19, 2022

"That wall might be 2 years in the future so usually it is next to impossible to convince anyone to take engineers out of product development. Even more so because working on this changes absolutely nothing for the end user"

It seems to be the same story in fiels of Infrastructure maintenance, Aircraft design (boeing Max), and mortgage CDOs (2008). Was it always like this or the new management doesn not care untill something explodes?

imachine1980_ · on Feb 19, 2022

a manufacturing company is designed the ground up to works whit machine but isn't the same whit software, is hard to understand that triple data isn't only triple server but a totaly different software stack, and exponentially more complex is not only put more factories like textile.

fragmede · on Feb 19, 2022

There's still order of magnitude change analogies to real world processes, if people are willing to listen (which is the hard part). Use something that everybody can understand, like making pancakes or waffles or an omelet. Going from making 1 by hand, every 4 minutes at home for your family, to 1,000 pancakes per minute at a factory is obviously going to take a better system. You can scale horizontally, and do the equivalent of putting more VMs behind the load balancer, and hire 4,000+ people to cook, but you still need to have/make that load balancer in the first place for even that to work.

That's the tip of iceberg when going from 1 per 4 minutes to 1,000 per minute though. How do you make and distribute enough batter for that system, and plating and serving that is going to take a pub/sub bus, err, conveyor belt to support the cooks' output. Again though, you still gotta make that kafka queue, err, conveyor belt, plus the maintenance for that is going to a team of people if you need the conveyor belt to operate 24/7/52. If your standards are so high that the system can never go down for more than 52.6 minutes per year or 13.15 minutes per quarter, then that team needs to consist of highly-trained and smart (read: expensive) people to call when the system breaks in the middle of the night.

nostrebored · on Feb 19, 2022

You had problems with management of a cloud based api and executive visibility… so you bought a set of data centers to handle 500mio req per month?

The visibility you will get after the capex when there’s a truly disastrous outage will be interesting.

NavinF · on Feb 19, 2022

Hmm that’s only 190Hz on average, but we don’t know what kind of search engine it is. For example if he’s doing ML inference for every query, it would make perfect sense to get a few cabinets at a data center. I’ve done so for a much smaller project that only needs 4 GPUs and saved a ton of money.

fxtentacle · on Feb 19, 2022

Nah, it's text-only requests returning JSON arrays of which newspaper article URLs mention which influencer or brand name keyword.

The biggest hardware price point is that you need insane amounts of RAM so that you can mmap the bloom hash for the mapping from word_id to document_ids.

winrid · on Feb 19, 2022

You could have used a sharded database like Mongo. Just throw up 10 shards, use "source" (influencer or brand name) as shard key?

fxtentacle · on Feb 19, 2022

Yes, I could have used Mongo, but it would have been 100x to 1000x slower than an mmap-ed look up table.

nostrebored · on Feb 19, 2022

Why ever use mmap instead of sharded inverted indices of word-doc here, a la elasticsearch?

winrid · on Feb 19, 2022

Yeah the question is what level of performance you need I guess... was hoping you could clarify :)

joshuamorton · on Feb 19, 2022

But you don't actually need that level of performance? You've made this system more complex and expensive to achieve a requirement that doesn't matter?

shoo · on Feb 19, 2022

you seem to have a deeper knowledge of the business & organisational context that dictate the true requirements than someone working there. please share these details so we can all learn!

joshuamorton · on Feb 19, 2022

Sure: the network request time of a person making a request over the open internet is going to be an order of magnitude longer than a DB lookup (in the right style, with a reverse-index) on the scale of data this person is describing. So making the lookup 10x faster saves you...1% of the request latency.

And at the qps they've described, it's not a throughput issue either. So I'm pretty confident in saying that this is a case of premature optimization.

And at some point the increase in parallelization of scans dominates mmap speed, unless you're redundantly sharding your mmaped hash table across multiple machines. And there are cases where network bandwidth is the bottleneck before disk bandwidth, though probably not this case. But yeah basically, the answer is something like "if this is the optimal choice, it probably didnt matter that much".

fxtentacle · on Feb 20, 2022

This reads to me as if you have never really used mmap in a dedicated C/C++ application. Just to give you a data point, looking up one word_id in the LUT and reading 20 document_ids from it takes on average 0.0000015 ms.

So if that alternative database takes on average 0.1ms per index read, then it's starting out roughly 65000x slower.

"than a DB lookup (in the right style, with a reverse-index)"

Unless, of course, you're managing petabytes of data ;)

"at the qps they've described, it's not a throughput issue either"

It's mostly a cost thing. If a single request takes 2x the time, that's also a 2x on the hosting bill.

"parallelization of scans dominates mmap speed"

Yes, eventually that might happen. Roughly when you have 100000 servers. But before that your 10gbit/s node-to-node link will saturate. Oops.

joshuamorton · on Feb 21, 2022

> Unless, of course, you're managing petabytes of data ;)

Are...are you saying that you've purchased petabyte(s) of RAM, and that that multi-million dollar investment is somehow cheaper than...well really anything else?

> But before that your 10gbit/s node-to-node link will saturate. Oops.

Only if you're returning dense results, which it sounds like you aren't (and there are ways to address this anyhow), which is why I said the issue of saturating network before disk probably wasn't an issue for you ;)

fxtentacle · on Feb 21, 2022

No, of course I have a tiered architecture. HDDs + SSDs + RAM. By mmap-ing the file, the Linux kernel will make sure that whatever data I access is in RAM and it'll do best-effort pre-reading and caching, which works very well.

BTW, this is precisely how "real databases" also handle their storage IO internally. So all of the performance cost I have to pay here, they have to pay, too.

But the key difference is that with a regular database and indices, the database needs to be able to handle read and write loads, which leads to all sorts of undesirable trade-offs for their indices. I can use a mathematically perfect index if I split dataset generation off of dataset hosting.

It's really quite difficult to explain, so I'll just redirect you to the algorithms. A regular database will typically use a B-tree index, which is O(log(N)). I'm using a direct hash bucket look-up, which is O(1).

For a mental model, you can think of "mmap" as "all the results are already in RAM, you just need to read the correct variable". There is no network connection, no SQL parsing, no query planning, no index scan, no data retrieval. All those steps would just consume unnecessary RAM bandwidth and CPU usage. So where a proper DB needs 1000+ CPU cycles, I might get away with just 1.

winrid · on Feb 21, 2022

No modern DB uses mmap because it's unreliable and hard to tune for performance.

A custom cache manager will always perform better than mmap provided by the kernel.

The problem is you haven't explained how the overhead of a DB is too much. Sure, it sounds like a lot of work for your servers and the DB compared to reading from a hashmap.

Where I work right now we fire around 1.5B queries a day... to Mongo.

nostrebored · on Feb 21, 2022

And have your unreliable, iconsistent, unscalable system. That apparently goes down all the time.

Not using ES here is actually nuts.

winrid · on Feb 20, 2022

Are you managing petabytes of data though?

What kind of servers are you running? What's your max QPS?

The fact is with your mmap impl. you probably use ram + virtual memory, and have more ram than needed to compensate for the fact that you don't keep the most used keys in memory, which a DB will do for you.

Point is if you have petabytes of data and access patterns only mean you access a subset of it, even Mongo might be cheaper to run.

fxtentacle · on Feb 21, 2022

Just FYI, MongoDB storage also uses mmap internally.

So we are comparing here "just mmap" with "mmap + all that connection handling, query parsing, JSON formatting, buffering, indexing, whatever stuff that MongoDb does".

And no, MongoDB is effectively never a cheap solution. They are used because they are super convenient to work with, with all things being JSON documents. But all that conversion to and from JSON comes at a price. It'll eat up 1000s of CPU cycles just to read a single document. With raw mmap, you could read 1000s of documents instead.

jd_mongodb · on Feb 21, 2022

MongoDB uses the Wired Tiger storage engine internally. The MMAP storage engine was removed from MongoDB in V4.2 which was released in March 2020. The MMAP engine was deprecated two years previously.

In MongoDB conversion to and from raw JSON into BSON (Binary JSON) is done on the client (aka driver) so the server cycles are not consumed.

winrid · on Feb 21, 2022

And 2 you're looking past the point. Any DB work work fine for this use case. If you wanted sharding, there's vitess for mysql, for example.

winrid · on Feb 21, 2022

As another already said, Mongo doesn't use mmap anymore.

Mongo doesn't convert to and from JSON. The driver uses a binary protocol.

Damogran6 · on Feb 19, 2022

As a security guy I HATE the loss of visibility in going to the cloud. Can you duplicate it? Sure. Still not as easily as spanning a trunk and you still have to trust what you’re seeing to an extent.

nostrebored · on Feb 19, 2022

The visibility I was mentioning in the parent comment was visibility from executives in your business, but I can see how it would be confusing.

There are tradeoffs — cloud removes much of the physical security risks and gives you tools to help automated incident detection. Things like serverless functions let you build out security scaffolding pretty easily.

But in exchange you do have to give some trust. And I totally understand resistance there.

justinclift · on Feb 19, 2022

> cloud removes much of the physical security risks

Doesn't cloud increase the physical security risks, rather than decrease/remove?

fxtentacle · on Feb 19, 2022

You might be surprised. The performance equivalent of $100k monthly in EC2 spend fits into a 16m2 cage with 52HU racks.

_iziv · on Feb 19, 2022

Which costs you more than $100k monthly to operate with the same level of manageability and reliability.

We don't use AWS, because our use cases don't require that level of reliability and we simply cannot afford it, but if I needed a company to depend on IT that generates enough revenue... I probably wouldn't argue about the AWS bill. So long, prepaid at hetzner + in-house works good enough, but I know what I cannot offer with the click of a button to my user!

Spooky23 · on Feb 19, 2022

This is a religious debate among many. The IT/engineering nerd stuff doesn’t matter at all. Cloud migration decisions are always made by accounting and tax factors.

I run two critical apps, one on-prem and one cloud. There is no difference in people cost, and the cloud service costs about 20% more on the infrastructure side. We went cloud because customer uptake was unknown and making capital investments didn’t make sense.

I’ve had a few scenarios where we’ve moved workloads from cloud to on-prem and reverse. These things are tools and it doesn’t pay to be dogmatic.

sdoering · on Feb 19, 2022

> These things are tools and it doesn’t pay to be dogmatic.

I wish I would hear this line more often.

So many things today are (pseudo-) religious now. The right frsmework/language, cloud or on prem, x vs not x.

Especially bad imho when somebody tries to tell you how you could do better with 'not x' instead of x you are currently using without even trying to understand the context this decision resides in.

[Edit] typo

qorrect · on Feb 19, 2022

> So many things today are (pseudo-) religious now. The right frsmework/language, cloud or on prem, x vs not x.

Might have always been that way? We just have so many more tools to argue over now.

dekhn · on Feb 19, 2022

that cage is a liability, not an asset. How is the networking in that rack? What's its connection to large-scale storage (IE, petabytes, since that's what I work with). What happens if a meteor hits the cage? Etc.

fxtentacle · on Feb 19, 2022

That depends on what contracts you have. You could have multiple of these cages in different locations. Also, 1 PB is only 56 large enterprise HDDs. So you just put storage into the cage, too.

But my point wasn't about how precisely the hardware is managed. My point was that with a large cloud, a mid-sized company has effectively NO SUPPORT. So anything that gives you more control is an improvement.

dekhn · on Feb 19, 2022

"1 pb is only 56 large enterprise hdds".

umm, what happens when one fails?

With large cloud my startup had excellent support. We negotiated a contract. That's how it works.

fxtentacle · on Feb 19, 2022

Typically people use RAID or ZFS to prevent data loss when a few hdds fail.

dekhn · on Feb 19, 2022

OK, so basically you're in a completely different class of expectations about how systems perform under disk loss and heavy load then me. A drive array is very different from large-scale cloud storage.

fxtentacle · on Feb 19, 2022

Hard to say. My impression is:

- A large ZFS pool of SSDs is much faster than any cloud storage.

- Cloud storage failed much more often than the SSDs in our pool.

- "Noisy neighbor" is an issue on the cloud

qorrect · on Feb 19, 2022

This cracked me up. Thanks fxtentacle :D.

dekhn · on Feb 21, 2022

of course, the reason that's wrong is that if one drive fails you don't have a 56pb storage system, you have something smaller because of redundancy.

That redundancy, and the performance that scales due to it, place cloud services in an entirely different class from on prem servers.

ckdarby · on Feb 19, 2022

>I used to feel powerless and stressed out by the complexity and the scale, because whenever stuff broke (and it always does at this scale), I had to start playing politics, asking for favors, or threatening people on the phone to get it fixed. Higher management would hold me accountable for the downtime even when the whole S3 AZ was offline and there was clearly nothing I could do except for hoping that we'll somehow reach one of their support engineers.

If the business can't afford to have downtime then they should be paying for enterprise support. You'll be able to connect to someone in < 10 mins and have dedicated individuals you can reach out to.

jerjerjer · on Feb 19, 2022

You never hosted on AWS, did you?

0x445442 · on Feb 19, 2022

In the two years I worked on serverless AWS I filed four support tickets. Three out of those four I came up with the solution or fix on my own before support could find a solution. The other ticket was still open when I left the company. But the best part is when support wanted to know how I resolved the issues. I always asked how much they were going to pay me for that information.

ckdarby · on Feb 19, 2022

>You never hosted on AWS, did you?

Previously 2k employee company, with the entire advertising back office on AWS.

Currently >$1M YR at AWS, you can get the idea of scale & what is running, here: https://www.youtube.com/playlist?list=PLf-67McbxkT6iduMWoUsh...

phillu · on Feb 19, 2022

Enterprise Support never disappointed me so far. Maybe not <10 minute response time, but we never felt left alone during an outage. But I guess this is also highly region/geo dependent.

FpUser · on Feb 19, 2022

>"they should be paying for enterprise support"

This sounds a bit arrogant. I think they found better and overall cheaper solution.

ckdarby · on Feb 19, 2022

>This sounds a bit arrogant.

The parent thread talks about how the business could not go down even with a triple AZ outage for S3, and I don't think it is arrogant to state they should be paying for enterprise support if that level of expectation is set.

>I think they found better and overall cheaper solution.

Cheaper solution does not just include the cost but also the time. For the time we need to look at the time they spent regardless of department to acquire, migrate off of AWS, modifying the code to work for their multi-private cloud, etc. I'd believe it if they're willing to say they did this, have been running for three years, and compiled the numbers in excel. It is common if you ask internally was it worth it to get a yes because people put their careers on it and want to have a "successful" project.

The math doesn't work out in my experiences with clients in the past. The scenarios that work out are, top 30 in the enitre tech industry, significant GPU training, egress bandwidth (CDN, video, assets), or business that are selling basically the infrastructure (think Dropbox, Backblaze, etc.).

I'm sure someone will throw down some post where their cost, $x is less than $y at AWS, but that is such a tiny portion that if the cost is not >50% it isn't even worth looking at the rest of the math. The absolute total cost of ownership is much harder than most clickbait articles are willing to go into. I have not seen any developers talk about how it changes the income statement & balance sheet which can affect total net income and how much the company will lose just to taxes. One argument assumes that it evens out after the full amortization period in the end.

Here are just a handful of factors that get overlooked, supply chain delays, migration time, access to expertise, retaining staff, churn increase due to pager/call rotation, opportunity cost of to capital being in idle/spare inventory and plenty more.

fxtentacle · on Feb 19, 2022

Back then, it was enough to saturate the S3 metadata node for your bucket and then all AZs would be unable to service GET requests.

And yes, this won't be financially useful in every situation. But if the goal is to gain operational control, it's worthwhile nonetheless. That said, for a high-traffic API, you're paying through the nose for AWS egress bandwidth, so it is one of those cases where it also very much makes financial sense.

ckdarby · on Feb 19, 2022

Same fxtenatcle as CTO of ImageRights? If that is the case my follow up question is did you actually move everything out of AWS? Or did you just take the same Netflix approach like Open Connect for the 95th billing + unmetered & peering with ISPs to reduce.

FpUser · on Feb 19, 2022

So you basically saying that no matter what one should always stick to Amazon. I have my own experience that tells exactly the opposite. To each their own. We do not have to agree.

ckdarby · on Feb 19, 2022

>So you basically saying that no matter what one should always stick to Amazon.

What I am saying is given the list of exceptions I gave the business should run/colocate their gear if they're in the exception list or those components that fall in the exception list should be moved out.

>I have my own experience that tells exactly the opposite.

You begin using AWS for your first day ever and on that day it has a tri AZ outage for S3. In this example the experience with AWS has been terrible. Zooming out though over 5 years it wouldn't look like a terrible experience at all considering outages are limited and honestly not that frequent.

FpUser · on Feb 19, 2022

>"You begin using AWS for your first day ever"

I am not talking about outages here. Bad things can happen. More like a price.

BossingAround · on Feb 19, 2022

I don't read that as arrogant. The full statement is:

> If the business can't afford to have downtime then they should be paying for enterprise support.

It's simply stating that it's either cheaper for business to have downtime, or it's cheaper to pay for premium support. Each business owner evaluates which is it for them.

If you absolutely can't afford downtime, chances are premium support will be cheaper.

jqgatsby · on Feb 19, 2022

@fxtentacle, I was curious which private search engine this is for. Is the system you are describing ImageRights.com?

fxtentacle · on Feb 19, 2022

No, ImageRights is much more requests and mostly images. Also, at ImageRights I don't have management above me that I would need to convince :)

This one is text-only and used by influencers and brands to check which newspapers report about their events. As I said, it's internally used by a few partner companies who buy the API from my client and sell news alerts to their clients.

BTW, I'm hoping to one day build something similar as an open source search engine where people pay for the data generation and then effectively run their own ad-free Google clone, but so far interest has been _very_ low:

https://news.ycombinator.com/item?id=30374611 (1 upvote)

https://news.ycombinator.com/item?id=30361385 (5 upvotes)

EDIT: Out of curiosity I just checked and found my intuition wrong. The ImageRights API averages 316rps = 819mio requests per month. So it's not that much bigger.

mmcnl · on Feb 19, 2022

If you rely on public cloud infrastructure, you should understand both the advantages and disadvantages. Seems like your company forgot about the disadvantages.

briandilley · on Feb 19, 2022

What i read here was "Cloud is hard, so I took on even more responsibility"

fxtentacle · on Feb 19, 2022

What you should read is: At the monthly spend of a mid-sized company, it is impossible to get phone support from any public cloud provider.

ddorian43 · on Feb 19, 2022

What are you using for aws alternatives? Example for S3?

ckdarby · on Feb 19, 2022

>What are you using for aws alternatives? Example for S3?

Not OP but they're probably using Rook/Minio

fxtentacle · on Feb 19, 2022

docker + self-developed image management + CEPH

flyinglizard · on Feb 19, 2022

Care to share uptime metrics on AWS vs your own servers?

fxtentacle · on Feb 19, 2022

That wouldn't be much help because the AWS and Heroku metrics are always green, no matter what. If you can't push updates to production, they count that as a developer-only outage and do not deduct it from their reported uptime.

For me, the most important metric would be time that me and my team spent fixing issues. And that went down significantly. After a year of everyone feeling burned out, now people can take extended vacations again.

One big issue for example was the connectivity between EC2 servers degrading, so that instead of the usual 1gbit/s they would only get 10mbit/s. It's not quite an outage, but it makes things painfully slow and that sluggishness is visible for end users. Getting reliable network speeds is much easier if all the servers are in the same physical room.

solatic · on Feb 19, 2022

What do you find exhausting?

One anti-pattern I've found is that most orgs ask a single team to handle on-call around the clock for their service. This rarely scales well, from a human standpoint. If you're getting paged at 2:00 in the morning on a regular basis you will start to resent it. There's not much you can do about that so long as only one team is responsible for uptime 24/7.

The solution is to hire operations teams globally, and then setup follow-the-sun operations whereby the people being paged are always naturally awake at that hour, and allows them to work normal eight hour shifts. But this requires companies to, gasp, have specialized developers and specialized operators collaborate before allowing new feature work into production, to ensure that the operations teams understand what the services are supposed to do and keep it all online. It requires (oh, the horror!) actually maintaining production standards, runbooks, and other documentation.

So naturally, many orgs would prefer to burn out their engineers instead.

jedberg · on Feb 19, 2022

I would respectfully say that you are wrong. I speak from experience. At Netflix we tried to hire for around the clock coverage. But what ended up working much better was taking that same team and having each person on call for a week at a time, all based in Pacific Time.

Yes, you would get calls at 2am, sometimes multiple days in a row. But you were only on call once every six to eight weeks, and we scheduled out well in advance so you could plan your life accordingly.

As a bonus, for the five weeks you weren't on call, you were highly incentivized (and had the time) to build tools or submit patches to fix the problems that woke you at 2am.

> It requires (oh, the horror!) actually maintaining production standards, runbooks, and other documentation.

I disagree with this too. Documentation and runbooks are useless in an outage. Instead of runbooks, write code to do the thing. Instead of documentation, comment the code and build automation to make the documentation unnecessary, or at least surface the right information automatically if you can't automate it.

deanc · on Feb 19, 2022

This is the same approach as night shifts for nurses.

There’s a lot of evidence to suggest that the effects on this infrequent but consistent disturbance to their circadian rhythms causes all kinds of physiological damage. One example [1]. We have to do better. I think the original suggestion of finding specialised night workers or those in other timezones is more humane.

[1] https://blogs.cdc.gov/niosh-science-blog/2021/04/27/nightshi...

jedberg · on Feb 19, 2022

That article is about night shift work, not day shift work that occasionally makes you work an hour or two at night every six weeks.

toomuchtodo · on Feb 19, 2022

Here is a reference that is a bit more attributable to the on call experience. There is a tangible human cost to after hours responses during an on call rotation. I personally do not recommend on call roles to any technology professional who can avoid them due to these health consequences of an on call requirement.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5449130/

> Sleep plays a vital role in brain function and systemic physiology across many body systems. Problems with sleep are widely prevalent and include deficits in quantity and quality of sleep; sleep problems that impact the continuity of sleep are collectively referred to as sleep disruptions. Numerous factors contribute to sleep disruption, ranging from lifestyle and environmental factors to sleep disorders and other medical conditions. Sleep disruptions have substantial adverse short- and long-term health consequences. A literature search was conducted to provide a nonsystematic review of these health consequences (this review was designed to be nonsystematic to better focus on the topics of interest due to the myriad parameters affected by sleep). Sleep disruption is associated with increased activity of the sympathetic nervous system and hypothalamic–pituitary–adrenal axis, metabolic effects, changes in circadian rhythms, and proinflammatory responses. In otherwise healthy adults, short-term consequences of sleep disruption include increased stress responsivity, somatic pain, reduced quality of life, emotional distress and mood disorders, and cognitive, memory, and performance deficits. For adolescents, psychosocial health, school performance, and risk-taking behaviors are impacted by sleep disruption. Behavioral problems and cognitive functioning are associated with sleep disruption in children. Long-term consequences of sleep disruption in otherwise healthy individuals include hypertension, dyslipidemia, cardiovascular disease, weight-related issues, metabolic syndrome, type 2 diabetes mellitus, and colorectal cancer. All-cause mortality is also increased in men with sleep disturbances. For those with underlying medical conditions, sleep disruption may diminish the health-related quality of life of children and adolescents and may worsen the severity of common gastrointestinal disorders. As a result of the potential consequences of sleep disruption, health care professionals should be cognizant of how managing underlying medical conditions may help to optimize sleep continuity and consider prescribing interventions that minimize sleep disruption.

magicalhippo · on Feb 19, 2022

> But what ended up working much better was taking that same team and having each person on call for a week at a time, all based in Pacific Time.

Our support team does the same, and they seem to be quite happy with it. They also get the following Friday off (in addition to compensation).

They do their best to shield us developers from after-hour calls, usually one can get things moving enough that it can be handled properly in the morning.

d0gsg0w00f · on Feb 19, 2022

Even as a dedicated operations team for a product, we did this too. On call person worked tickets and took calls for one week at a time, the rest of the team worked on ways to make on-call suck less. For an eight person team it worked well for about three years until bigger stuff happened in the org and we all parted ways.

uselessextras · on Feb 20, 2022

I agree with you completely, especially on the last paragraph. No pain - no gain.

solatic · on Feb 19, 2022

> you were highly incentivized (and had the time) to build tools or submit patches to fix the problems that woke you at 2am.

Ah, so you worked on a team where the SRE needs were prioritized over the feature requests? Because in most companies where I've worked, Product + Customer Service + Sales + Marketing + Executives don't really have time or patience for the engineers to get their diamond polishing cloths out. They want to see feature development. They're willing to be forced to prioritize exactly which feature they'll get soonest, and they understand that engineering needs time to keep the systems running, but in most businesses I've worked, the business comes first.

> Documentation and runbooks are useless in an outage. Instead of runbooks, write code to do the thing. Instead of documentation, comment the code and build automation to make the documentation unnecessary

We do that too. If you could write code to Solve All The Problems then you'd never need to page a human in the first place ;)

I'll give you a simple example of where you can't write code to solve this sort of thing. Let's say that you have an autoscaler that will scale your server group up to X servers. You define an alert to page you if the autoscaler hits the maximum. The page goes off. Do you really want to write code to arbitrarily increase the autoscaler maximum whenever it hits the maximum? Why do you have the maximum in the first place? The entire reason why the autoscaler maximum exists is to prevent cost overruns from autoscaling run amok. You want a human being, not code, to look at the autoscaler and make the decision. Do you have steady-slow growth up to the maximum? Maybe it should be raised, if it represents natural growth. Maybe it shouldn't, if you just raised it last week and it shouldn't be anywhere near this busy. Do you have hockey-stick growth? Maybe the maximum is working as expected, looks like a resource leak hit production. Or maybe you have a massive traffic hit and you actually do want to increase the maximum. Maybe you'd prefer to take the outage from the traffic hit, let the 429s cool everyone off. But good luck trying to write code to handle that automatically, and correctly for you!

> or at least surface the right information automatically if you can't automate it.

Ah, well, that's exactly what the dedicated operations staff are doing, because when you have three follow-the-sun teams, you need standards, not three sets of people who each somehow telepathically share the same tribal knowledge?

Don't get me wrong, I'm not anti-automation or something. If your operations folks are click-clicking in consoles all day long, the same click-clicking every day, probably something's wrong. But the SRE model asks for operations automation to stick within operations teams, not development teams.

jedberg · on Feb 19, 2022

> Ah, so you worked on a team where the SRE needs were prioritized over the feature requests?

Yes, it was an SRE team. All we do is write tools to make operations better, but more importantly we write tools to make it easier for the dev teams to operate their own systems better. But yes, we had products teams that would push back on our requests because they had product to deliver, and that was fine. We'd either figure out how to do the work for them, or figure out a workaround.

> We do that too. If you could write code to Solve All The Problems then you'd never need to page a human in the first place ;)

Well yes, that's the idea. You can't get to 5 9s of reliability unless it's all automated. :)

> I'll give you a simple example of where you can't write code to solve this sort of thing.

I could easily write code to solve the thing. Step one, double the limit to alleviate immediate customer pain. Step two, page someone to wake up and look at the graphs and figure out what the better medium term solution is to get us through until the morning, including links to said relevant graphs.

You're not gonna have a cost overrun doubling the limit for one night. And if there is a big problem, the person will get paged again a few hours later and have more information to make a better decision.

> But the SRE model asks for operations automation to stick within operations teams, not development teams.

Yes, but I'm not sure I see why that's bad. I don't see any purpose for a dedicated operations team, especially a follow the sun team. If you're Google and you already have offices all around the world, sure, it will be better. But it makes no sense to hire an around the world team just for operations if the rest of your company is in one time zone.

solatic · on Feb 20, 2022

> Yes, it was an SRE team. All we do is write tools to make operations better

Go back to my original comment. If you're an SRE team, then basically, you're the operations team for the developers. I'm talking about where developers are responsible for their own operations and there is no team that gets paged instead of them - "most orgs ask a single team to handle on-call around the clock for their service."

> Step one, double the limit to alleviate immediate customer pain. Step two, page someone to wake up

See, what I read from this is: a) violate my system efficiency KPIs while b) paging someone in the middle of the night anyway. So, lose-lose.

> But it makes no sense to hire an around the world team just for operations if the rest of your company is in one time zone.

Why does it make any more sense to hire developers remotely who are in your time zone ± three hours? Because that's what most companies are doing these days. If you're already hiring people remotely then you can hire Operations/SRE staff a little further afield and see that as a benefit (follow the sun) rather than a problem.

> the rest of your company is in one time zone.

For what it's worth, we also hire salespeople around the globe :) Fact of the matter is, it would be so, so nice for Slack to turn off the ability to @channel in the #random channel so that people who are asleep don't get pinged ...

jedberg · on Feb 20, 2022

We were an SRE team building tools for the development teams who got paged in the middle of the night. The devs writing the services were operating their own services and were getting paged. We would sometimes also get paged for a serious incident so we could coordinate if multiple development teams were involved.

Each team managed their own rotation schedules, we just made sure they had one.

> See, what I read from this is: a) violate my system efficiency KPI

If you're being graded on your system efficiency and not customer satisfaction, well then sure, your way might make sense (but I'd still say it doesn't). But your business will suffer if you optimize for efficiency over customer satisfaction.

> Why does it make any more sense to hire developers remotely who are in your time zone ± three hours?

Because it's a lot easier to run a team where everyone on the team can meet at the same time. If you have an around the world team, there is no time of day where you can have a meeting and everyone gets to attend during their workday. Realistically you can maybe get away with a nine hour time difference. Any more than that and you have people excluded.

Especially if the bulk of your devs are in one or two time zones, your operators will be even more disconnected from them since they will never be able to interact with the devs, and the devs will have no empathy for the operators who they also never interact with.

> For what it's worth, we also hire salespeople around the globe

Sure, but they aren't writing code that your operators have to run. :)

I think we both agree that it's better for devs to get paged for their services instead of operators, and if that's the case, its far better for all the devs to work together and know each other and be in the same or nearly same time zone.

A follow the sun model breaks that completely.

solatic · on Feb 20, 2022

> But your business will suffer if you optimize for efficiency over customer satisfaction.

But who are the customers? Business, engineering, or finance? :)

> it's a lot easier to run a team where everyone on the team can meet at the same time.

Of course it's easier. It's also easier not to maintain documentation or standards, just be a five person startup and have everyone be in the same room. Enterprise communication is hard! Even when you're in the same time zone. The question isn't "how do I get my life to be a utopia?" but "which challenges should I choose?". If you run an organization, you need to put your employees first, even ahead of your customers. Employees and customers both come and go but 80% of the time the effect of an valued employee leaving is far worse than a customer leaving, and you have far more control over whether employees leave than whether customers do. So you can either put your employees first (build a calm workplace) or you can put your customers first (prioritize feature development velocity in organizational design).

> I think we both agree that it's better for devs to get paged for their services instead of operators

No! Dev should never be paged! If I "buy" Jenkins off-the-shelf, and it breaks down in production, guess what, I don't get to page the Jenkins developers! Why should internally developed services be any different? If Ops needs to page someone from Dev instead of waiting for a response at normal business cadence, then this is an Ops failure, not a Dev failure!

jedberg · on Feb 20, 2022

> But who are the customers? Business, engineering, or finance? :)

The business's customers. The ones who pay your company so they can pay you, and your reason for having a job at all.

> Why should internally developed services be any different?

Because they're your core competency and you have control over it. If you could page the Jenkins developers you probably wouldn't hesitate to do it, because you'll get better results. Why not get the best results you can from an internal service?

> If Ops needs to page someone from Dev instead of waiting for a response at normal business cadence, then this is an Ops failure, not a Dev failure!

I couldn't disagree more. That is absolutely a dev failure -- they wrote a service that couldn't operate under the conditions given. It's either a bug or an architecture issue, but no matter what, it's a dev issue and the dev should be responsible for building a service that can actually run in production.

You and I have very different ideas of a successful engineering organization. I would never want to work for your org as an operator or a dev. As an operator the last thing I want is devs to throw whatever they write over the wall and then say "not my problem anymore!", and have to rely on getting retrained every time the code changes. And as a dev I wouldn't want to be in an organization that accepts sloppy developers who aren't responsible for building solid code that can run under adverse conditions and who don't get to experience the issues in production for themselves.

Facebook makes their devs get paged, Netflix does, Amazon pages their devs, Dropbox pages devs, Stripe pages devs, and Google pages their devs too until they have demonstrated multiple quarters of success, and only then does an operator take over. And if the service has too many failures, support falls back on the devs until they can make the service stable again.

Making devs responsible for creating code that actually works well in production is a good thing.

solatic · on Feb 20, 2022

> As an operator the last thing I want is devs to throw whatever they write over the wall and then say "not my problem anymore!", and have to rely on getting retrained every time the code changes. And as a dev I wouldn't want to be in an organization that accepts sloppy developers who aren't responsible for building solid code that can run under adverse conditions and who don't get to experience the issues in production for themselves.

How can you classify anybody who writes on-prem software as being a "sloppy developer"? Jira, Jenkins, GitLab, pretty much any database you can imagine (MySQL, PostgreSQL, Redis, Elasticsearch, Kafka...), Grafana, any Linux distribution, they're all written by "sloppy developers"?

Where did I say that Dev gets to "throw code over the wall"? How would you feel if I unilaterally decided for you, as a developer, which tools you get to use? If I came up with some policy that the whole organization can only run Windows machines and I "threw that policy over the wall" at you?

You're arguing against a strawman that's completely inconsistent with how harmonious follow-the-sun Ops actually works.

jedberg · on Feb 20, 2022

I would not call follow the sun ops as harmonious. If anything I'd call it adversarial. Ops is always trying to blame dev for outages and dev is always trying to blame ops. Each accuses the other of not sharing all the necessary information.

Look at all that on-prem software you just mentioned. The developers of every one of those complain that they need better bug reports, and the people who operate them complain they need better documentation. Things would be much better if those devs worked directly for every company that uses them, and in fact in a lot of cases one of the contributors is an operator at a company. Why do you think companies like to hire open source devs? To get better access to someone who knows the codebase!

It's far better if the operator is the developer. Sometimes we live with that not being the case because the software is made by others. But when given the choice, I will always opt for the dev running the software themselves.

aij · on Feb 19, 2022

> Step one, double the limit to alleviate immediate customer pain.

I've been oncall for systems where that would not work.

Doubling the memory means you need twice as many machines. Depending on the service, that could require significantly increased network bandwidth. Now the network is saturated and every node needs to queue more data. Now latency and throughput are even worse, and even more requests are being dropped, so you automatically double the limit again...

jedberg · on Feb 19, 2022

While that all may be true (but are indications of a poorly architected system), my code would still work. It would double the limit and then page someone. If they logged in and saw all those failures, then they could address those issues.

The whole point is that having an around the world follow the sun team would not alleviate those issues or make anything better.

mavelikara · on Feb 20, 2022

> You want a human being, not code, to look at the autoscaler and make the decision.

Should this decision happen at 2am? Can it wait until 10am?

notacoward · on Feb 19, 2022

This. Absolutely this. Working on large distributed system can be both exhilarating and exhausting. The two often go hand in hand. However, working on such systems without diligence tips the scales toward exhausting. If your testing and your documentation and your communication (both internal and with consumers) suck, you're in for a world of pain.

"But writing documentation is a waste of time because the code evolves so fast."

Yeah, I hear that, but there's also a lot of time lost to people harried during their on-call and still exhausted for a week afterward, to training new people because the old ones burned out or just left for greener pastures, to maintaining old failed experiments because customers (perhaps at your insistence) still rely on them and backing them out would be almost as much work than adding them was, and so on.

That's not really moving fast. That's just flailing. You can actually go further faster if you maintain a bit of discipline. Yes, there will still be some "wasted" time, but it'll be a bounded, controlled waste like the ablative tiles on a re-entry vehicle - not the uncontrolled explosion of complexity and effort that seems common in many of the younger orgs building/maintaining such systems nowadays.

bckr · on Feb 19, 2022

> That's not really moving fast. That's just flailing.

Yes, a million times yes. This is moving me. Where do I find a team that understands this wisdom?

grogers · on Feb 19, 2022

The solution to get paged at off hours a lot is rarely to hire additional teams to cover those times for you, at least not long term. For things you can control, you should fix the root causes of those issues. For things you can't control you should spend effort on making them within your control (eg architecture improvement). This takes time, so follow-the-sun rotation might be a stop gap solution, but you need to make sure it doesn't cover over the real problems without them getting any better.

ahelwer · on Feb 19, 2022

From experience, it's really hard to fix the root causes of issues when you were woken up three times the night before and had two more of the same incident occur during the workday. In my case I struggled along for a couple years but the best thing to do was just leave and let it be someone else's problem.

kqr · on Feb 19, 2022

Best thing for what? Surely not software quality and customer satisfaction.

ahelwer · on Feb 19, 2022

If they cared about that they would either pay me so much money I'd be insane to walk away or they would hire people in other time zones to cover the load. Instead they chose to pay for their customer satisfaction with my burnout. The thing about that strategy is... eventually the thing holding their customer satisfaction together gets burnt out. So I leave. And even then they're still getting the better half of the bargain.

kqr · on Feb 19, 2022

Sorry, I accidentally said you did the wrong thing for leaving. That wasn't my intention. Of course, leaving was the right choice for you.

What I meant was the company you were working for does not get the best quality or customer satisfaction by overworking you to the point where you have to leave. It would have been better for their software quality to handle things differently.

NationalPark · on Feb 19, 2022

I don’t think this is a stable long term solution. The “on call” teams end up frustrated with the engineers who ship bugs and this results in added process that delays deploys, arbitrary demands for test coverage, capricious error budgets, etc. It’s much better to have the engineers who wrote the code be responsible for running it, and if their operational burden becomes too high, to staff up the dev team to empower them to go after root causes. Plus the engineers who wrote the code always have better context than reliability people who tend to be systems experts but lack the business logic intuition to spot errors at a glance.

Hermitian909 · on Feb 19, 2022

I don't think the parent was implying you're never on call for your code, just only on call during working hours.

One of the challenges for larger companies in trying to make teams on-call 24/7 is that your most senior engineers often have enough money that they don't have to take on-call. Some variation of the following conversation happens in Big Tech more than most people seem to anticipate:

"hey, so I have 7 mil in the bank, a house, and kids; so I'm not taking on-call anymore"

"I understand on-call is a burden, but the practice is a big part of how we maintain operational excellence"

"Alright, I quit"

"Woah woah woah, uh, ok, what about we work on transitioning you out of on call over the next 6 months?"

"Nah, I'm done"

"This is going to be really disruptive to the team!"

"Yeah man it sucks, I really feel for you"

My understanding is a few famous outages at large cloud providers are a direct result of management not anticipating these conversations and assuming 24/7 on-call from a single geographically centered team of high powered engineers was sustainable.

solatic · on Feb 19, 2022

> The “on call” teams end up frustrated with the engineers who ship bugs and this results in added process that delays deploys, arbitrary demands for test coverage, capricious error budgets, etc.

This is poor operations culture. Software is no different from industrial manufacturing. You QA before you ship product to customers and you QA your raw materials before you start to process them. Operations is responsible for catching show-stopper bugs before they hit production. This means that operations is responsible for pushing to staging, not developers; operations stakeholders need to be looped into feature planning to ensure that feature work will easily integrate into the operations culture (somebody's got to tell the developers they can't adopt MySQL if it's a PostgreSQL shop, etc.). Fundamentally, Ops needs to be able to say No to Dev. The SRE take on it is to "hand the pager back to Dev", but the actual method of saying No is different from Ops culture to Ops culture.

> reliability people who tend to be systems experts but lack the business logic intuition to spot errors at a glance

If Dev didn't build the monitoring, the observability, put proper logging in place, etc., then honestly, Dev isn't going to spot the errors at a glance. Customer Service will when customers complain. @jedberg seems to think that Developers should write code to auto-solve their operations issues. If Developers can write code to auto-solve their operations issues, and Developers obviously anyway need to add telemetry etc., then why, pray tell, should it be so unreasonable to expect Developers to be able to succinctly add the kind of telemetry and documentation that explains the business logic, according to an Operations standard, such that Operations can thus keep the system running?

kqr · on Feb 19, 2022

Correct. Throwing software over the wall to "other people" and letting them deal with the problems of running the software is guaranteed to lead to low quality, inefficient processes, or usually both.

Ozzie_osman · on Feb 19, 2022

I'd argue that timezone is just part of the problem. If you're responsible for a high oncall load, you are subjected to a steady, unpredictable stream of interrupts requiring you to act to minimize downtime or degradation. Obviously it's worse if you get these at night, but it's still bad during the day.

I think the anti-pattern is having one team responsible for another's burden. You want teams to both be responsible for fixing their own systems when they break, AND be empowered to build/fix their broken systems to minimize oncall incidents.

toast0 · on Feb 19, 2022

At the end of the day, there's a human cost to responding to pages, and there's a human cost to collaboration.

Both of those can drive burn out. Personally, I find all that collaboration work very hard and stressful, so I work better in a situation where I get pages for the services I control; but that would change if pages were frequent and mostly related to dependencies outside of my control. It also helps to have been working in organizations that prioritize a working service over features. Getting frequent overnight issues that can't be resolved without third party effort that's not going to happen anytime soon is a major problem that I see reports of in threads like this.

I can also get behind a team that can manage the base operations issues like ram/storage/cpu faults on nodes and networking. The runbooks for handling those issues are usually pretty short and don't need much collaboration.

thebackup · on Feb 19, 2022

My experience is that the expectations on what your average engineer should be able to handle has grown enormously during the last 10 years or so. Working both with large distributed systems and medium size monolithic systems I have seen the expectations become a lot higher in both.

When I started my career the engineers at our company were assigned a very specific part of the product that they were experts on. Usually there were 1 or 2 engineers assigned to a specific area and they knew it really well. Then we went Agile(tm) and the engineers were grouped into 6 to 9 person teams that were assigned features that spanned several areas of the product. The teams also got involved in customer interaction, planning, testing and documentation. The days when you could focus on a single part of the system and become really good at it were gone.

Next big change came when the teams moved from being feature teams to devops teams. None of the previous responsibilities were removed but we now became responsible also for setting up and running the (cloud) infrastructure and deploying our own software.

In some ways I agree that these changes have empowered us. But it is also, as you say, exhausting. Once I was simply a programmer; now I'm a domain expert, project manager, programmer, tester, technical writer, database admin, operations engineer, and so on.

altacc · on Feb 19, 2022

It sounds like whomever shaped your teams & responsibilities didn’t take into account the team’s cognitive load. I find it’s often overlooked, especially by those who think agile means “everyone does everything”. The trick is to become agile whilst maintaining a division of responsibilities between teams.

If you look up articles about Team Topologies by Matthew Skelton and Manuel Pais, they outline a team structure that works for large, distributed systems.

thebackup · on Feb 19, 2022

I'll have a look a the book. Thanks!

ithkuil · on Feb 19, 2022

> None of the previous responsibilities were removed but we now became responsible also for setting up and running the (cloud) infrastructure and deploying our own software

On the flipside, in the olden days when one set of people were churning features and another set of people were given a black box to run and be responsible for keep it running, it was very hard to get the damn thing to work reliably and the only recourse you often had was to "just be more careful", which often meant release aversion and multi-year release cycles.

Hence, some companies explored alternatives, found ways to make them work, wrote about their success but a lot of people copied only half of the picture and then complained that it didn't work.

bckr · on Feb 19, 2022

> only half of the picture

Can you please share some details about what you think is missing from most "agile"/devops teams?

ithkuil · on Feb 19, 2022

Proper staffing

bckr · on Feb 19, 2022

Ah excellent. Yes. In my experience there's this idea of "scale at all costs"--a better way would probably be to limit scaling until the headcount is scaled. Although then you probably need more VC money.

datavirtue · on Feb 20, 2022

Might I add that you are also now underpaid. I had a sweet gig at a very small company where I had to manage contractors in addition to FTE staff. The good contractors billed $300 an hour for BA and project management services alone. The story munchers billed $150 an hour.

I had to leave a contracting gig recently because we were tasked with everything...literally everything. Everyone got so burnt out--FTEs included. I also might add that the developers could have spoken up and gotten relief but their misguided work ethic prevented that.

heisenbit · on Feb 19, 2022

In these large scale systems the boundaries are usually not well defined (there are APIs but data flowing through the APIs is another matter as are operational and non functional requirements).

Stress is often caused by a mismatch of what you feel responsible and accountable for and what you really control. The more you know the more you feel responsible for but you are rarely able to expand control as much or as fast as your knowledge. It helps to be very clear about where you have ultimate say (accountability) or control within some framework (responsibility) or simply know and contribute. Clear in your mind, others and your boss. Look at areas outside your responsibility with curiosity and willingness to offer support but know that you are not responsible and others need to worry.

tylerrobinson · on Feb 19, 2022

This is spot on. Feeling frustrated working on large distributed systems could be generalized as “feeling frustrated working in a large organization” because the same limitations apply. You learn about things you cannot control, and it is important to see the difference between what you can control and contribute and what you can’t.

hliyan · on Feb 19, 2022

The first ten years of my career, I worked with distributed systems built on this stack: C++, Oracle, Unix (and to some extent, MFC and Qt). There were hundreds of instances of dozens of different type of processes (we would now call these microservices) connected via TCP links, running on hundreds of servers. I seldom found this exhausting.

The second ten years of my career, I worked with (and continue to work on) much more simpler systems, but the stack looks like this: React/Angular/Vue.js, Node.js/SpringBoot, MongoDB/MySQL/PostGreSQL, ElasticSearch, Redis, AWS (about a dozen services right here), Docker, Kubenetes. _This_ is exhausting.

When you spend so much time wrangling a zoo of commercial products, each with its own API and often own vocabulary for what should be industry standards (think TCP/IP, ANSI, ECMA, SQL), and being constantly obsoleted by competing "latest" products, that you don't have enough time to focus on code, then yes, it can be exhausting.

softwarebeware · on Feb 19, 2022

You know what? This is a really great point. When I reflect back on my career experience (at companies like Expedia, eBay, Zillow, etc.) the best distributed systems experience I had was at companies that standardized on languages and frameworks and drew a pretty strong boundary around those choices.

It wasn't that you technically couldn't choose another stack for a project, but to do so you had to justify the cost/benefit with hard data, and the data almost never bore out more benefit than cost.

asymmetric · on Feb 19, 2022

Reminds me of http://boringtechnology.club/

datavirtue · on Feb 20, 2022

Modern day embarrassing spaghetti cloud.

uselessextras · on Feb 20, 2022

Absolutely right.

qxmat · on Feb 19, 2022

I've found that external tech requirements are horrible to work with, especially when the underlying stack simply doesn't support it. Normally these are pushed by certified cloud consultants or by an intrepid architect who found another "best practice blog."

It's begins with small requirements such as coming up with a disaster recovery plan only for it to be rejected because your stack must "automatically heal" and devs can't be trusted to restore a backup during an emergency.

Blink and you're implementing redundant networking (cross AZ route tables, DNS failover, SDN via gateways/load balancers), a ZooKeeper ensemble with >= 3 nodes in 3 AZs, per service health checks, EFS/FSX network mounts for persistent data that expensive enterprise app insists storing on-disk and some kind of HA database/multi-master SQL cluster.

... months and months of work because a 2 hour manual restore window is unacceptable. And when the dev work is finally complete after 20 zero-downtime releases over 6 months (bye weekend!) how does it perform? Abysmally - DNS caching left half the stack unreachable (partial data loss) and the mission critical Jira Server fail-over node has the wrong next-sequence id because Jira uses an actual fucking sequence table (fuck you Atlassian - fuck you!).

If only the requirement was for a DR run-book + regular fire drills.

theptip · on Feb 19, 2022

I think this highlights the importance of actually analyzing your RP/RT (recovery point/recovery time) requirements through the lens of business value, and being honest about the ROI of buying that extra 9 in uptime.

It may be the case that 2 hours of downtime is completely unacceptable for the business, and paying $Xmm extra per year to maintain it is the right call. Or it may be that the business would be horrified to learn how many dollars are being spent to avert a level of downtime that no customer would notice or care about.

If the requirement is just being set by engineering, then it's more about finding the equilibrium where the resource spent on automation balances the cost of the manual toil and the associated morale impact on the team. Nobody wants to work on a team where everything is on fire all the time, and it's time/money well spent to avert that situation.

fullstackchris · on Feb 19, 2022

...how is the JIRA server mission critical? is it tied to CI/CD somehow?

qxmat · on Feb 19, 2022

In the enterprise you'll find that Jira is used for general workflow management not just CICD. I've encountered teams of analysts spend their working day moving and editing work items. It's the Quicken of workflow management solutions.

Jira Server is deliberately cobbled by the sequence table + no Aurora support and now EOL (no security updates 1 year after purchase!). DC edition scales horizontally if you have 100k.

Jira in general is a poorly thought out product (looking at you customfield_3726!) but it's held in such a high regard by users it's impossible to avoid.

hogrider · on Feb 19, 2022

Pre covid I would have laughed at this. But now, no one knows what a user story should be unless you can reas it off jira and there are no backups of course.

_clhx · on Feb 19, 2022

Gives me a fun idea: a program that randomly deletes items out of your backlog.

Gwypaas · on Feb 19, 2022

"Chaos engineering for your backlog"

wreath · on Feb 20, 2022

I done that. I deleted items from the backlog that i thought make no sense (anymore), nobody cared or asked any questions. If you didn't work on it for the last 18 months, it's probably not important and nobody cares.

nikhilsimha · on Feb 19, 2022

I used to lead teams that owned message bus, a stream processing framework and a distributed scheduler (like k8s) at Facebook.

The oncall was brutal. At some point I thought I should work on something else, perhaps even switch careers entirely. However this also forced us to separate user issues and system issues accurately. That’s only possible because we are a platform team. Since then I regained my love for distributed systems.

Another thing is, we had to cut down on the complexity - reduce number of services that talked to each other to a bare minimum. Weigh features for their impact vs. their complexity. And regularly rewrite stuff to reduce complexity.

Now Facebook being Facebook, valued speed and complexity over stability and simplicity. Specially when it comes to career growth discussions. So it’s hard to build good infra in the company.

robertlagrant · on Feb 19, 2022

I like that the mantra went from "move fast and break things" to (paraphrased) "move fast and don't break things".

ehnto · on Feb 19, 2022

It's been a pretty poor mantra from the beginning anyway. How about we move at a realistic pace and deliver good features, without burning out, and without leaving a trail of half-baked code behind us?

robertlagrant · on March 1, 2022

I think it's probably less fun to gradually replace things with better things than to - say - write your own alternative PHP backend.

wilde · on Feb 19, 2022

Without more info it’s hard to say. When I felt like this, a manager recommended I start journaling my energy. I kept a Google doc with sections for each week. In each section, there’s a bulleted list of things I did that gave me energy and a list of things I did that took energy.

Once you have a few lists some trends become clear and you can work with your manager to shift where you spend time.

m_herrlich · on Feb 19, 2022

I love building and developing software, and despite the fun and interesting challenges presented at my last job I quit because of the operations component. We adopted DevOps and it felt like "building" got replaced with "configuring" and managing complex configurations does not tickle my brain at all. Week-long on-call shifts are like being under house arrest 24/7.

I understand the value that developers bring to operational roles, and to some extent making developers feel the pain of their screwups is appropriate. But when DevOps is 80% Ops, you need a fundamentally different kind of developer.

throwhauser · on Feb 19, 2022

After-hours on-call is a thing that needs to be destroyed. A company that is sufficiently large that the CEO doesn't get woken up for emergencies needs to have shifts in other timezones to handle them. I don't know why people put up with it.

bckr · on Feb 19, 2022

Part of it is a culture that discourages complaining about after hours work.

There's an expectation that everyone is a night owl and that night time emergency work is fun, and that these fires are to be expected.

Finally, engineers seem to get this feeling of being important because they wake up and work at night. It's really a form of insanity.

jmyeet · on Feb 19, 2022

It's hard to answer this because you don't specify what exactly you find exhausting. Is it oncall? Deployment? Performance issues? Dealing with different teams? Failures and recovery? The right hand not knowing what the left hand is doing? Too many services? Something else?

It's not even clear how big your service is. You mention billions of requests per month. Every 1B requests/month translates to ~400 QPS, which isn't even that large. Like, that's single server territory. Obviously spikiness matters. I'd also be curious what you mean by "large amount of data".

wreath · on Feb 19, 2022

> Every 1B requests/month translates to ~400 QPS, which isn't even that large

I said billions not one billion.

I guess what I find exhausting is the long feedback cycle. For example, Writing a simple script that makes two calls to different APIs requires tons of wiring for telemetry, monitoring, logging, error handling, integrating w/ two APIs, setting up the proper kubernetes manifests, setting up the required permissions to run this thing and have them available to k8s. I find all this to be exhausting. We're not even talking about operating this thing yet (on call, running in issues with the APIs owned by other teams etc)

sangnoir · on Feb 19, 2022

This sounds like your team/organization needs to invest in tooling. Processes that take long should ideally be automated and done async, notification of the result is generated some time later, freeing up some of your time.

bfung · on Feb 19, 2022

Automate that process that you find tedious; if you find it tedious, ask your coworkers if they do as well. Make the right time/automation trade offs. https://xkcd.com/1205/

Yes, work is tedious.

jedberg · on Feb 19, 2022

I find it exhilarating, but you have to have a well architected distributed system. Some key points:

- Your micro service should be able to run independently. No shared data storage, no direct access into other microservices' storage.

- Your service should protect itself from other services, rejecting requests before it becomes overloaded.

- Your service should be lenient on the data it accepts from other services, but strict about what it sends.

- Your service should be a good citizen, employing good backoffs when other services it is calling appear overloaded.

- The API should be the contract and fully describe your service's relationship to the other services. You should absolutely collaborate with the engineers who make other services, but at the end of the day anything you agree on should be built into the API.

Generally if you follow these best practices, you shouldn't have to maintain a huge working knowledge of the system, only detailed knowledge of your part, which should be small enough to fit into your mental model.

There will be a small team of people responsible for the entire system and how it fits together, but ideally if everyone is following these practices, they won't need to know details of any system, only how to read the APIs and the call graph and how the pieces fit together.

sillysaurusx · on Feb 19, 2022

Jobs aren’t exhausting. Teams are. If you find yourself feeling this way, consider that the higher ups may be mismanaging.

There’s often not a lot of organizational pressure to change anything. So the status quo stays static. But the services change over time, so the status quo needs to change with them.

softwarebeware · on Feb 19, 2022

Agree with this. Conway's Law will always hold. If a company does not organize it's teams into units that actually hold full responsibility and full control/agency over that responsibility, those teams will burn out.

When getting anything done requires constant meetings, placing tickets, playing politics, and doing anything and everything to get other teams to accept that they need to work with you and prioritize your tasks so that you can get them done, you will burn out.

angarg12 · on Feb 19, 2022

I don't find it exhausting, I find it *exhilarating*.

After years of proving myself, earning trust and strategical positioning I am finally leading a system that will support millions of requests per second. I love my job and this is the most intellectually stimulating activity I have done in a long while.

I think this is far from the expectation of the average engineer. You can find many random companies with very menial and low stake work. However if you work at certain companies you sign up for this.

BTW I don't think this is unreasonable. This is precisely why programmers get paid big bucks, definitely in the US. We have have a set of skills that require a lot of talent and effort, and we are rewarded for it.

Bottom line this isn't for everyone, so if you feel you are done with it that's fair. Shop around for jobs and be deliberate about where you choose to work, and you will be fine.

bob1029 · on Feb 19, 2022

> I am finally leading a system that will support millions of requests per second.

This is the difference. Millions of things per second is a super hard problem to get right in any reality. Pulling this off with any technology at all is rewarding.

Most distributed systems are not facing this degree of realistic challenge. In most shops, the challenge is synthetic and self-inflicted. For whatever reason, people seem to think saying things like "we do billions of x per month" somehow justifies their perverse architectures.

artiscode · on Feb 19, 2022

Your story is close to home. I was part of a team that integrated our newly-acquired startup with a massive, complex and needlessly distributed enterprise system that burned me out.

Being forced to do things that absolutely did not make sense(CS wise) was what I found to be most exhausting. Having no other way than writing shitty code or copying functionality into our app led me to an eventual burnout. My whole career felt pointless as I was unable to apply any of my skills and expertise that I learned over all these years, because everything was designed in a complex way. Getting a single property into an internal API is not a trivial task and requires coordination from different teams as there are a plethora of processes in place. However I helped to build a monstrous integration layer and everything wrong with it is partly my doing. Hindsight is 20/20 and I now see there really was no other, better way to do it, which feels nice in a schadenfreude kind of way.

I sympathise with your point about not understanding what is expected of an average engineer nowadays. Should you take initiative and help manage things, are you allowed to simply write code and what should you expect from others were amongst my pain points. I certainly did not feel rewarded for going the extra mile, but somehow felt obliged because of my "senior" title.

I took therapy, worked on side projects and I'm now trying out a manager role. My responsibilities are pretty much the same, but I don't have to write code anymore. It feels empowering to close my laptop after my last Zoom meeting and not think about bugs, code, CI or merging tomorrow morning because it's release day tomorrow.

But hey, grass is always greener on the other side! I think taking therapy was one of my life's best decisions after being put through the ringer. Perhaps it will help you as well!

throwaway984393 · on Feb 19, 2022

It's exhausting when the business does not give you the support you need and leans on you to do too much work. Find another place to work where they do things without stress (ask them in the interview about their stress levels and workload). Make sure leadership are actively prioritizing work that shores up fundamental reliability and continuously improves response to failure.

When things aren't a tire fire, people will still ask you to do too much work. The only way to deal with it without stress is to create a funnel.

Require all new requests come as a ticket. Keep a meticulously refined backlog of requests, weighted by priorities, deadlines and blockers. Plan out work to remove tech debt and reduce toil. Dedicate time every quarter to automation that reduces toil and enables development teams to do their own operations. Get used to saying "no" intelligently; your backlog is explanation enough for anyone who gets huffy that you won't do something out of the blue immediately.

bob1029 · on Feb 19, 2022

> We run into very interesting problems due to scale (billions of requests per month for our main public apis) and the large amount of data we deal with.

So, if you are handling 10 billion requests per month, that would average out to about 4k per second.

Are these API calls data/compute intensive, or is this more pedestrian data like logging or telemetry?

Any time I see someone having a rough time with a distributed system, I ask myself if that system had to be distributed in the first place. There is usually a valuable lesson to be learned by probing this question.

petters · on Feb 19, 2022

Yes! A single machine can handle tons of traffic in many cases.

jacquesm · on Feb 19, 2022

That question probably needs more information.

But your 'average engineer' is probably better served by asking themselves the question whether the system really needed to be that large and distributed rather than if working on them is exhausting. The vast bulk of the websites out there doesn't need that kind of overkill architecture, typically the non-scalable parts of the business preclude needing such a thing to begin with. If the work is exhausting that sounds like a mismatch between architecture choice and size of the workforce responsible for it.

If you're an average (or even sub average) engineer in a mid sized company stick to what you know best and how to make that work to your advantage, KISS. A well tuned non-distributed system with sane platform choices will outperform a distributed system put together by average engineers any day of the week, and will be easier to maintain and operate.

softwarebeware · on Feb 19, 2022

I find it "exhilirating," not "exhausting." But I also don't think that "...your average engineer should now be able to handle all this." That is where we went completely wrong as an industry. It used to be said that what we work on is complex, and you can either improve your tools or you can improve your people. I've always held that you will have to improve your people. But clever marketing of "the cloud" has held out the false promise that anyone can do it.

Lies, lies, and damn lies, I say!

Unless you have bright and experienced people at the top of a large distributed systems company, who have actually studied and built distributed systems at scale, your experience of working in such a company is going to suck, plain and simple. The only cure is a strong continuous learning culture, with experienced people around to guide and improve the others.

ublaze · on Feb 19, 2022

Yeah, large-scale systems are often boring in my experience, because the scale limits what features you can add to make things better. Each and every decision has to take scale into account, and it's tricky to try experimenting.

I think it has to do with the kind of engineer you are. Some engineers love iterating and improving such systems to be more efficient, more scalable, etc. But it can be limiting due to the slower release cycles, hyper focus on availability, and other necessary constraints.

harshaw · on Feb 19, 2022

I don't think they are boring, but very important on the kind of engineer you are. At AWS I try to encourage people who like the problem space and at the very least appreciate it, but can totally understand that you don't want to do your entire career on it. Many of our younger folks have never felt the speed and joy you can get with hammering out a simple app (web, python, ML) that doesn't have to work at scale.

guilhas · on Feb 19, 2022

Recently I was asked to work on a older project for enterprise customers. And we are always weary of working on old unmaintained code

But it just felt like a breath of fresh air

All code in same repository, UI, back-end, SQL, MVC style Fast from feature request to deliver in production. Changes, test, fix bugs, deploy. We were happy and the customers too

No cloud apps, buckets, secrets, no oauth, little configuration, no docker, no micro services, no proxies, no CICD. It does look somewhere along the way we overcomplicated things

BatteryMountain · on Feb 19, 2022

100% agree with you. OAuth + Docker/Kubernetes + massive configs to make things to build sucks the life out of every project for me that has them. And when it uses a non-git version control system.

benlivengood · on Feb 19, 2022

Google's SRE books cover a lot of the things that large teams managing large distributed systems encounter and how to tackle it in a way that doesn't burn out engineers. Depending on organization size/spread, follow-the-sun oncall schedules drastically reduce burnout and apprehension about outages. Incident management procedures give confidence when outages do happen. Blameless postmortems provide a pathway to understanding and fixing the root causes of troublesome outages. Automation reduces manual toil. Google SRE has been keeping a lot of things running for a decade or more and has learned a lot of lessons. I did that from 2014 to 2018 and it seemed like a pretty mature organizational approach, and the books document essentially that era.

chubot · on Feb 19, 2022

My take is that it's exhausting because everything is so damn SLOW.

"Back to the 70's with Serverless" is a good read:

https://news.ycombinator.com/item?id=25482410

The cloud basically has the productivity of a mainframe, not a workstation or PC. It's big and clunky.

----

I quote it in my own blog post on distributed systems

http://www.oilshell.org/blog/2021/07/blog-backlog-2.html

https://news.ycombinator.com/item?id=27903720 - Kubernetes is Our Generation's Multics

Basically I want basic shell-like productivity -- not even an IDE, just reasonable iteration times.

At Google I saw the issue where teams would build more and more abstraction and concepts without GUARANTEES. So basically you still have to debug the system with shell. It's a big tower of leaky abstractions. (One example is that I had to turn up a service in every data center at Google, and I did it with shell invoking low level tools, not the abstractions provided)

Compare that with the abstraction of a C compiler or Python, where you rarely have to dip under the hood.

IMO Borg is not a great abstraction, and Kubernetes is even worse. And that doesn't mean I think something better exists right now! We don't have many design data points, and we're still learning from our mistakes.

----

Maybe a bigger issue is incoherent software architectures. In particular, disagreements on where authoritative state is, and a lot of incorrect caches that paper over issues. If everything works 99.9% of the time, well multiple those probabilities together, and you end up with a system that requires A LOT of manual work to keep running.

So I think the cloud has to be more principled about state and correctness in order not to be so exhausting.

If you ask engineers working on a big distributed system where the authoritative state in their system is stored, then I think you will get a lot of different answers...

seanwilson · on Feb 19, 2022

It's okay to prefer working on small single server systems with small teams for example. I do this while contracting quite often and enjoy how much control you get to make big changes with minimal bureaucracy.

Sometimes it feels like everyone is focused on eventually working with Google scale systems and following best practices that are more relevant towards that scale but you can pick your own path.

rhacker · on Feb 19, 2022

Humans GET simplicity from extreme hyper complexity.

Take a gas generator. Easy, add oil and gas and get electricity and these days they even come in a smoothed over plastic shell that makes it look like a toy. Inside, very complex, spark plugs, engine, coils, inverter. A hundred years of inventions packed into a 1.5' x 1.5' box.

It's the same thing for complicated systems. Front end to back. No matter how ugly or how much you wish it was refactored - some exec knows it as a box where you put something in and magical inference comes out. Maybe that box actually causes real change in the physical world - like billions of packages being sent out all over the world.

In the days of castles you would have similar systems managed by people. People that drag wooden carts of shit out of a castle. Carrying water around. Manually husking corn and wheat and what have you.

No matter how far into the future we go, we will continue to get simple out of monstrous complexity.

That's not the answer to your question - but it's just that the world will always lean towards going that way.

lumost · on Feb 19, 2022

Handling scale is a technically challenging problem, if you enjoy it - then take advantage! however sometimes taking a break to work on something else can be more satisfying.

Typically on a "High scale" service spanning hundreds or thousands of servers you'll have to deal with problems like. "How much memory does this object consume?", "how many ms will adding this regex/class into the critical path use?", "We need to add new integ/load/unit tests for X to prevent outage Y from recurring", and "I wish I could try new technique Y, but I have 90% of my time occupied on upkeep".

It can be immensely satisfying to flip to a low/scale, low/ops problem space and find that you can actually bang out 10x the features/impact when you're not held back by scale.

Source: Worked on stateful services handling 10 Million TPS, took a break to work on internal analytics tools and production ML modeling, transitioning back to high scale services shortly.

karmakaze · on Feb 19, 2022

I'm trying to relate this to my experiences. The best I can make of it is that burnout comes from dealing with either the same types of problems, or new problems at a rate that's higher than old problems get resolved.

I've been in those situations. My solution was to ensure that there was enough effort into systematically resolving long-known issues in a way that not only solves them but also reduces the number of new similar issues. If the strategy is instead to perform predominantly firefighting with 'no capacity' available for working on longer term solutions there is no end in sight unless/until you lose users or requests.

I am curious what the split is of problems being related to:

1. error rates, how many 9s per end-user-action, and per service endpoint

2. performance, request (and per-user-action) latency

3. incorrect responses, bugs/bad-data

4. incorrect responses, stale-data

5. any other categories

Another strategy that worked well was not to fix the problems reported but instead fix the problems known. This is like the physicist looking for keys under the streetlamp instead of where they were dropped. Tracing a bug report to a root cause and then fixing it is very time consuming. This of course needs to continue, but if sufficient effort it put to resolving known issues, such as latency or error rates of key endpoints, it can have an overall lifting effect reducing problems in general.

A specific example was how effort into performance was toward average latency for the most frequently used endpoints. I changed the effort instead to reduce the p99 latency of the worst offenders. This made the system more reliable in general and paid off in a trend to fewer problem reports, though it's not easy/possible to directly relate one to the other.

smoyer · on Feb 19, 2022

Using micro-services instead of monoliths is a great way for software engineers to reduce the complexities of their code. Unfortunately, it moves the complexity to operations. In an organization with a DevOps culture, the software engineers still share responsibility for resolving issues that occur between their micro-service and others.

In other organizations, individual teams have ICDs and SLAs for one or more micro-services and can therefore state they're meeting their interface requirements as well as capacity/uptime requirements. In these organizations, when a system problem occurs, someone who's less familiar with the internals of these services will have to debug complex interactions. In my experience, once the root-cause is identified, there will be one or more teams who get updated requirements - why not make them stakeholders at the system-level and expedite the process?

ickyforce · on Feb 19, 2022

> Using micro-services instead of monoliths is a great way for software engineers to reduce the complexities of their code

Could you share why you think that's true?

IMO that it's exactly the opposite - microservices have potential to simplify operations and processes (smaller artifacts, independent development/deployments, isolation, architectural boundaries easier to enforce) but when it comes to code and their internal architecture - they are always more complex.

If you take microservices and merge them into a monolith - it will still work, you don't need to add code or increase complexity. You actually can remove code - anything related to network calls, data replication between components if they share a DB, etc.

gbtw · on Feb 19, 2022

In all the situations i have had to work on microservices it generally means the team just works on all the different services, now spread out over more applications. Doing more integration work vs actual business logic. Because the fancy microservices the architect wanted doesn't mean there's actually money to do it properly or even have an ops team.

Also for junior team members a lot of this stuff works via magic because they can't yet oversee where the boundaries are or do not understand all the automagically configuration stuff.

Also the amount of works on my machine with docker is staggering even if the developers laptop's are the same batch / imaged machine.

Too · on Feb 20, 2022

One problem I frequently see with distributed systems is not the amount of services and the distributed nature per se.

Rather that it allows, and tempts, you to use the perfect tool for each job. Leading to a lot of variations in your stack.

Suddenly you have 5 different databases, 3 RPC protocols, 4 programming languages and 2 operating systems spinning around in your cluster. Only half of them connected to your single sign on. And don’t forget about all the cloud dependencies.

If any one of them starts misbehaving you have to read up “how did I attach debugger to Java process again”. How do I even log in to a mongodb shell? I installed pgadmin last week.

Standardize your stack and accept that some times it might mean using something slightly inefficient in the small scheme. In the big scheme it will make things more homogenous, unified and simpler for operators.

glintik · on Feb 19, 2022

The most undervalued thing that forgot even highly skilled engineers - KISS principle. That’s why you are burning out supporting such systems.

jeffrallen · on Feb 19, 2022

Yes, it's amazing how much one modern high spec system running good code can do. Turn off all the distributed crap and just use a pair in leader/follower config with short ttl DNS to choose the leader and manual failover scripts. If your app/company/industry cannot accept the compromises from such a simple config, quit and work in one which can.

BatteryMountain · on Feb 19, 2022

Good code? Where?

This whole thread feels like therapy since I face the same monsters on the systems I work on. Partly due to bad platform & code, partly due to bad organization structure (Conway's 100% for us).

My pet projects at home is the only thing keeping me sane, mostly because they are simple.

avensec · on Feb 19, 2022

Yes, but in a different way. I work in Quality Engineering, and the scope of maturity in testing distributed systems has been exhausting.

Reading other comments from the thread, I see similar frustrations from teams I partner with. How to employ patterns like contact, hypothesis, doubles, or shape/data systems (etc.) typically gets conflated with System testing. Teams often disagree on the boundaries of the system start leaning towards System Testing, and end up adding additional complexity in tests that could be avoided.

My thought is that I see the desire to control more scope presenting itself in test. I typically find myself doing some bounded context exercises to try to hone in on scope early.

yodon · on Feb 20, 2022

I so wish there were in-person meetups and conferences going on so I might have been nearby and overheard you saying that so I could try to join in the conversation. Sounds fascinating and just the sort of insight that doesn't come up in the entirely planned and scheduled zooms I'm usually in (and HN, for all its virtues, isn't really a substitute for a great conversation).

asim · on Feb 19, 2022

Yup. Spent more than a decade doing it. Got so frustrated that I started a company to try abstract it all away for everyone else. It's called M3O https://m3o.com. Everyone ends up building the same thing over and over. A platform with APIs either built in house or an integration to external public APIs. If we reuse code, why not APIs.

I should say, I've been a sysadmin, SRE, software engineer, open source creator, maintainer, founder and CEO. Worked at Google, bootstrapped startups, VC funded companies, etc. My general feeling, the cloud is too complex and I'm tired on waiting for others to fix it.

randomsilence · on Feb 19, 2022

>Consume public APIs as simpler programmable building blocks

Is the 'r' in simpler there intentionally? In which way are the building blocks more simple than simple blocks?

asim · on Feb 21, 2022

Simpler than the public APIs

macksd · on Feb 19, 2022

Mental / emotional burnout is certainly not uncommon in tech (probably in most other careers, I'd bet). Most people in Silicon Valley are changing jobs more often than 4-5 years. I don't like to constantly be the new guy, but there is a refreshing feeling to starting on something new and not carrying years of technical debt on your emotions. Maybe it's time to try something new, take a bigger vacation than usual, or talk to someone about new approaches you can try in your professional or personal life. But certainly don't let the fact that you feel like this add to the load - you're not alone, and it's not permanent.

eez0 · on Feb 19, 2022

I find it actually the other way around.

As you said, a benefit of large distributed systems is that usually its a shared responsibility, with different teams owning different services.

The exhaustion comes into place when those services are not really independent, or when the responsibility is not really shared, which in turn is just a worse version of a typical system maintained by sysadmins.

One thing that helps is bring the DevOps culture into the company, but the right way. It's not just about "oh cool we are now agile and deploy a few times a day", it's all down to shared responsibility.

kortex · on Feb 19, 2022

It definitely can be. I'm constantly trying to push our stack away from anti-patterns and towards patterns that work well, are robust, and reduce cognitive load.

It starts by watching Simple Made Easy by Rich Hickey. And then making every member of your team watch it. Seriously, it is the most important talk in software engineering.

https://www.infoq.com/presentations/Simple-Made-Easy/

Exhausting patterns:

- Mutable shared state

- distributed state

- distributed, mutable, shared state ;)

- opaque state

- nebulosity, soft boundaries

- dynamicism

- deep inheritance, big objects, wide interfaces

- objects/functions which mix IO/state with complex logic

- code than needs creds/secrets/config/state/AWS just to run tests

- CI/CD deploy systems that don't actually tell you if they successfully deployed or not. I've had AWS task deploys that time out but actually worked, and ones that seemingly take, but destabilize the system.

---

Things that help me stay sane(r):

- pure functions

- declarative APIs/datatypes

- "hexagonal architecture" - stateful shell, functional core

- type systems, linting, autoformatting, autocomplete, a good IDE

- code does primarily either IO, state management, or logic, but minimal of the other ops

- push for unit tests over integration/system tests wherever possible

- dependency injection

- ability to run as much of the stack locally (in docker-compose) as possible

- infrastructure-as-code (terraform as much as possible)

- observability, telemetry, tracing, metrics, structured logs

- immutable event streams and reducers (vs mutable tables)

- make sure your team takes time periodically to refactor, design deliberately, and pay down tech debt.

solididiot · on Feb 19, 2022

Only read the transcript but I'm not getting most of it. I mean it starts with a bunch of aphorisms we all agree with but when it should be getting more concrete it goes on with statements that are kind of vague.

E.g. what exactly does it mean to: >> Don’t use an object to handle information. That’s not what objects were meant for. We need to create generic constructs that manipulate information. You build them once and reuse them. Objects raise complexity in that area.

What kind of generic constructs?

islandert · on Feb 19, 2022

I agree with most of you points, but the one that stands out is "push for unit tests over integration/system tests wherever possible".

By integration/system tests, do you mean tests that you cannot run locally?

LoveGracePeace · on Feb 19, 2022

Most of that I agree with, I'm curious why you'd recommend unit tests over integration tests? It seems at odds with the direction of overall software engineering best practices.

unnouinceput · on Feb 19, 2022

I wrote such a system. 6+ years, between end of '07 to beginning of '14. It grew organically, with more and more end points as time went by, and when I exited the project it had over 250 end points, each having hundreds of thousand of users requests per day. By your measurement, that would mean the system I wrote would've handled in a month a total of 250 (end points) x 30 (days) x ~400k (requests per day) == 3B user requests in a month.

To my knowledge the system is still used to this day and I think it grew 10x meanwhile, so I think it's serving over 30B requests each month.

That being said, to answer your question - Yes! I got tired of it, started to plateau and felt I was lagging behind in terms of keeping up with technology around me. So I exited but at the same time I also started to get involved in other projects as well. So in the end I was overworked and I ditched the biggest project of my entire career as freelancer because payment was not worth anymore. I wanted to feel excited and the additional projects eventually made up in terms of money, but boy oh boy! the variation is what made me not feeling burnout. Nowadays if I feel another project is going that route I discuss with client to replace me with a team once I deliver the project in a stable state and for horizontal scaling.

bofaGuy · on Feb 19, 2022

Worked on a team at BofA, our application would handle 800 million events per day. The logic we had for retry and failure was solid. We also had redundancy across multiple DCs. I think we processed like 99.9999999% of all events successfully. (Basically all of them, last year we lost about 2,000 events total) I didn’t find it very stressful at all. We build in JMX Utica for our production support teams be able to handle practically anything they would need to.

bofaGuy · on Feb 19, 2022

Utils*

ChrisMarshallNY · on Feb 19, 2022

TLDR; Yes, it is exhausting, but I have found ways to mitigate it.

I don't develop stuff that runs billions of queries. More like thousands.

It is, however, important infrastructure, on which thousands of people around the world, rely, and, in some cases, it's not hyperbole to say that lives depend on its integrity and uptime.

One fairly unique feature of my work, is that it's almost all "hand-crafted." I generally avoid relying on dependencies out of my direct control. I tend to be the dependency, on which other people rely. This has earned me quite a few sneers.

I have issues...

These days, I like to confine myself to frontend work, and avoid working on my server code, as monkeying with it is always stressful.

My general posture is to do the highest Quality work possible; way beyond "good enough," so that I don't have to go back and clean up my mess. That seems to have worked fairly well for me, in at least the last fifteen years, or so. Also, I document the living bejeezus[0] out of my work, so, when I inevitably have to go back and tweak or fix, in six months, I can find my way around.

[0] https://littlegreenviper.com/miscellany/leaving-a-legacy/

zaphirplane · on Feb 19, 2022

Front end and no dependencies, tell us more

ChrisMarshallNY · on Feb 19, 2022

Feel free to see for yourself. I have quite a few OS projects out there. My GH ID is the same as my HN one.

My frontend work is native Swift work, using the built-in Apple frameworks (I ship classic AppKit/UIKit/WatchKit, using storyboards and MVC, but I will be moving onto the newer stuff, as it matures).

My backend work has chiefly been PHP. It works quite well, but is not where I like to spend most of my time.