Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Do you find working on large distributed systems exhausting?
316 points by wreath on Feb 19, 2022 | hide | past | favorite | 249 comments
Ive been working on large distributed system for the last 4-5 years with teams owning few services or have different responsibilities to keep the system up and running. We run into very interesting problems due to scale (billions of requests per month for our main public apis) and the large amount of data we deal with.

I think it has progressed my career and expanded my skills but I feel it's pretty damn exhausting to manage all this even when following a lot of the best-practices and working with other highly skilled engineers.

I've been wondering recently if others feel this kind of burnout (for lack of better word). Is the expectation is that your average engineer should now be able to handle all this?



Yes, I used to,

but No, I fixed it :)

Among other things, I am team lead for a private search engine whose partner-accessible API handles roughly 500 mio requests per month.

I used to feel powerless and stressed out by the complexity and the scale, because whenever stuff broke (and it always does at this scale), I had to start playing politics, asking for favors, or threatening people on the phone to get it fixed. Higher management would hold me accountable for the downtime even when the whole S3 AZ was offline and there was clearly nothing I could do except for hoping that we'll somehow reach one of their support engineers.

But over time, management's "stand on the shoulders of giants" brainwashing wore off so that they actually started to read all the "AWS outage XY" information that we forwarded to them. They started to actually believe us when we said "Nothing we can do, call Amazon!". And then, I found a struggling hosting company with almost compatible tooling and we purchased them. And I moved all of our systems off the public cloud and onto our private cloud hosting service.

Nowadays, people still hold me (at least emotionally) accountable for any issue or downtime, but I feel much better about it :) Because now it actually is withing my circle of power. I have root on all relevant servers, so if shit hits the fan, I can fix things or delegate to my team.

Your situation sounds like you will constantly take the blame for other people's fault. I would imagine that to be disheartening and extremely exhausting.


I feel that your problems aren't even remotely related to my problems with large distributed systems.

My problems are all about convincing the company that I need 200 engineers to work on extremely large software projects before we hit a scalability wall. That wall might be 2 years in the future so usually it is next to impossible to convince anyone to take engineers out of product development. Even more so because working on this changes absolutely nothing for the end user, it is usually some internal system related to data storage or processing which can't cope anymore.

Imagine that you are Amazon and for some scalability reason you have to rewrite the storage layer of your product catalog. Immediately you have a million problems like data migration, reporting, data ingestion, making it work with all the related systems like search, recommendations, reviews and so on.

And even if you get the ball rolling you have to work across dozens of different teams which can be hard because naturally people resist change.

Why do large sites like Facebook, Amazon, Twitter and Instagram all essentially look the same after 10 years but some of them now have 10x the amount of engineers? I think they have so much data and so many dependencies between parts of the system that any fundamental change is extremely hard to pull off. They even cut back on features like API access. But I am pretty sure that most of them have rewritten the whole thing at least 3 times.


> Why do large sites like Facebook, Amazon, Twitter and Instagram all essentially look the same after 10 years but some of them now have 10x the amount of engineers? I think they have so much data and so many dependencies between parts of the system that any fundamental change is extremely hard to pull off. They even cut back on features like API access. But I am pretty sure that most of them have rewritten the whole thing at least 3 times.

I used to work on a Unicorn a few years ago, and this hits close to home. From 2016 to 2020, the pages didn't change one single pixel, however there we had 400 more engineers working on the code and three stack iterations: full-stack PHP, PHP backend + React SSR frontend, Java backend + [redacted] SSR frontend (redacted because only two popular companies use this framework). All were rewrites, and those rewrites were justified because none of them was ever stable, the site was constantly going offline. However each rewrite just added more bloat and failure points. At some point the three of them were running in tandem: PHP for legacy customers, another as main and another on an A/B test. (Yeah, it was a dysfunctional environment and I obviously quit).


> Yeah, it was a dysfunctional environment and I obviously quit

What do you think could management have done better to make it not dysfunctional and have people quitting?


I think just common sense and less bullshit rationalisation would have been enough.

They had a billion dollars in cash to burn, so they hired more than they needed. They should have hired as needed, not as requested by Masayoshi Son.

They shouldn't be so dogmatic. Some teams were too overworked, most were underworked (which means over-engineering will ensue), but no mobility was allowed because "ideally teams have N people".

They shouldn't be so dogmatic pt 2. Services were one-per-team, instead of one-per-subject. So yeah, our internal tool for putting balloons and clowns into images lived together with the authentication micro-service, because it's the same team.

Rewriting everything twice without analysis was wrong. The rewrites were because previous versions were "too complex" and too custom-made but newer ones had an even more complex architecture, but "this time it's right, software sometimes need complexity".

Believing that some things were terrible would have gone a long way. Launching the main node.js server locally would take 10 to 20 minutes to launch, while something of the same complexity would often take about 2 or 3 seconds. Of course it would blow up in production! Maybe try to fix instead of ordering another rewrite.

They were good people, I miss the company and still use the product, but it didn't need to be like this.


> They shouldn't be so dogmatic pt 2. Services were one-per-team, instead of one-per-subject.

Where the heck did this come from? AIUI, the ideal is supposed to be one-team-per-service, not one-service-per-team.


It comes from a dogmatic reaction against microservices. Microservices were problematic in certain ways, but instead of analysing what went wrong and why, they just went the opposite direction and started doing "big services only". It was a misguided approach, plain and simple.

Interestingly due to internal bureaucracy and understaffing in some teams, there was a lot of "multiple-teams-per-service", which yeah, is another issue in itself.



I don't know your specifics, but I have worked on some large scale architecture changes, and 200 engineers + 2 year feature freeze is generally not a reasonable ask. In practice you need to find an incremental path with validation and course correction along the way to limit the amount of concurrent change in flight at any moment. If you don't do this run a very high risk of the entire initiative collapsing under its own weight.

Assuming your estimation is more or less correct and it really is a 400 eng-year project, then you also need political capital as well as technical leadership to make it happen. There are lots of companies where a smart engineer can see a potential path out of a local maximum, but the org structure and lack of technical leadership in the highest ranks means that the problem is effectively intractable.


>I need 200 engineers to work on extremely large software projects before we hit a scalability wall. That wall might be 2 years in the future

sounds like a typical massive rewrite project. They almost never succeed, many fail outright and most hardly even reach the functionality/performance/etc. level of the stuff the rewrite was supposed to replace. 2-4 years is typical for such glorious attempt before being closed or folded into something else. Management in general likes such projects, and they declare victory usually around 2 years mark and move on on the wave of the supposed success before reality hits the fan.

>to convince anyone to take engineers out of product development.

that means raiding someone's budget. Not happening :) New glorious effort needs new glorious budget - that is what management likes and not doing much more on the same budget as you're basically suggesting (i.e. i'm sure you'll get much more traction if you restate your proposal as "to hire 200 more engineers ..." because that way you'll be putting serious technical foundation for some mid-managers to grow :). You're approaching this as an engineer and thus failing in what is the management game (or as Sun Tzu was pointing out one has to understand the enemy).


My impression has always been that FAANG need lots of engineers because the 10xers refuse to work there. I've seen plenty of really scalable systems being built by a small core team of people who know what they are doing. FAANG instead seem to be more into chasing trends, inventing new frameworks, rewriting to another more hip language, etc.

I would have no idea how to coordinate 200 engineers. But then again, I have never worked on a project that truly needed 50+ engineers.

"Imagine that you are Amazon and for some scalability reason you have to rewrite the storage layer of your product catalog." Probably that's 4 friends in a basement, similar to the core Android team ;)


Your impression comes from the fact that you have not worked at larger teams, as you said so yourself. It's relatively easy to build something scalable from the beginning if you know what you need to build and if you are not already handling large amounts of traffic and data.

It's a whole different ballgame to build on top of an existing complex system already in production that was made to satisfy the needs at the time it was built but it now needs to support other features, bug fixes and supporting existing features but at scale while having 50+ engineers not step on each other and not break each others code in the process. 4 friends in the basement will not achieve more than 50+ engineers in this scenario, even when considering the inefficiencies of the difficulty in communication that come along with so many minds working on the same thing.


GP said they have never work on something that truly needed 50+ engineers. Truly being the keyword here IMO.

I have worked on a 1000+ engineer project and another that was 500+, but I'm on the same boat as GP. Both of those didn't needed 50+, and the presence of the extra 950/450 caused several communication, organisational and architectural issues that became impossible to fix on the long term.

So I can definitely see where they're coming from.


I've long wondered what I might be able to keep an eye out for during onboarding/transfer that would help me tell overstuffed kitchens apart from optimally-calibrated engineering caves from a distance.

I'm also admittedly extremely curious what (broadly) had 1000 (and 500) engineers dedicated to it, when arguably only 50 were needed. Abstractly speaking that sounds a lot like coordinational/planning micromanagement, where the manglement had final say on how much effort needed to be expended where instead of allowing engineering to own the resource allocation process :/

(Am I describing the patently impossible? Not yet had experience in these types of environments)


> a lot like coordinational/planning micromanagement, where the manglement had final say on how much effort needed to be expended where instead of allowing engineering to own the resource allocation process

Yep, that's a fair assessment!

The 1000+ one was an ERP for mid-large businesses. They had 10 or so flagship products (all acquired) and wanted to consolidate it all into a single one. The failure was more on trying to join the 10 teams together (and including lots of field-only implementation consultants in the bunch), rather than picking a solid foundation that they already owned and handpicking what needed.

The 500+ was an online marketplace. They had that many people because that was a condition imposed by investors. People ended up owning parts of a screen, so something that was a "two-man in a sprint" ended up being a whole team. It was demoralising but I still like the company.

I don't think it's impossible to notice, but it's hard... you can ask during interviews about numbers of employees, what each one does, ask for examples of what each team does on a daily basis. Honestly 100, 500, 1000 people for a company is not really a lot, but 100, 500, 1000 for a single project is definitely a red flag for me now, and anyone trying to pull the "but think of the scale!!!" card is a bullshit artist.


Yay, I'm learning :D

> trying to join the 10 teams together

oh no

(insert https://webcomicname.com/ here)

> rather than picking a solid foundation that they already owned and handpicking what needed.

Mmmm.

I wonder if a close alternative (notwithstanding lack of context to optimally calibrate ideas off of) might have involved leaving all the engineers alone to compare notes for 6-12 months with the singular top-down goal of "decide what components and teams do what best." That could be interesting... but it leans very heavily on preexisting competence, initiative and proactivity (not to mention conflict resolution >:D), and is probably a bit spherical-cow...

> The 500+ was an online marketplace. They had that many people because that was a condition imposed by investors.

*Constructs getaway vehicle in spare time* AAAAAaaaaaa

Sad engineering face :<

> I don't think it's impossible to notice, but it's hard... you can ask during interviews about numbers of employees, what each one does, ask for examples of what each team does on a daily basis.

Noted. Thanks.

> Honestly 100, 500, 1000 people for a company is not really a lot, but 100, 500, 1000 for a single project is definitely a red flag for me now, and anyone trying to pull the "but think of the scale!!!" card is a bullshit artist.

That makes a lot of sense, and also filed away.

Also, I recently read this which resonates quite strongly with the economy-of-efficiency scale problem (which I totally agree with): https://rachelbythebay.com/w/2022/01/26/swcbbs/, and the update, https://rachelbythebay.com/w/2022/01/27/scale/


> what I might be able to keep an eye out for during onboarding/transfer that would help me tell overstuffed kitchens apart from optimally-calibrated engineering caves from a distance

The biggest thing I've been able to correlate are command styles: imperative vs declarative.

I.e. is management used to telling engineering how to do the work? Or communicating a desired end result and letting engineering figure it out?

I think fundamentally this is correlated with bloat vs lean because the kind of organizations that hire headcount thoughtlessly inevitably attempt to manage the chaos by pulling back more control into the PM role. Which consequently leads to imperative command styles: my boss tells me what to do, I tell you, you do it.

The quintessential quote from a call at a bad job was a manager saying "We definitely don't want to deliver anything they didn't ask for." This after having to cobble together 3/4 of the spec during the project, because so much functionality was missed.

Or in interview question form posed to the interviewer: "Describe how you're told what to build for a new project." and "Describe the process if you identify a new feature during implementation and want to pitch it for inclusion."


Of course. Wow, I never thought about management like that before. But particularly in software development it makes so much sense for people to jump toward this sort of mindset.

There really is an art to scaling problems to humans so the individual work (across management and engineering) falls within the sweet spot of cognitive saturation. TIL yet another dimension that can go sideways.

The signal to noise ratio is very appreciated.


Yeah, exactly. There is overhead simply because of the (necessary) cross-communication at that scale, and there's overhead from legacy support, but here's a thought experiment. Imagine that you've built the most perfect system from scratch that you can think of. Fast forward five years, and the business has pivoted so many times that system is doing all sorts of stuff it just wasn't designed for, and it's creaky and old. It just doesn't fit right anymore and even you want to throw it away and build a new one. So you form a tiger team full of the smartest people you know to greenfield build a new one, from scratch, but that's gonna take two years to write. (You think, hey, maybe we could just take this open source thing and adapt it to our purposes. To which I say, where do you think large open source projects come from‽)

How do you bridge the two systems? You build an interim system. But customers want new features, so those features need to be done twice (bridge+new) if you're lucky, three times (existing+interim+new) if not. Could a smaller team of 10x engineers come in and do better? First off, thanks for insulting all of us, as if none of us are 10x-ers. But no. There's simply not enough hours in the day.

We've all heard of large IT projects that failed to land and said "of course". But we don't hear about the huge ones that do. And plenty of them do land, quite succesfully, with these 200+ person teams where I, as an SRE, don't know the code for the system I'm supporting.

None of this is visible from the outside.


> I've seen plenty of really scalable systems being built by a small core team of people who know what they are doing.

There is huge difference between building a system that could theoretically be scaled up and actually scaling it up efficiently.

At small scales, it's really easy to build on the work of others and take things for granted without even knowing where the scaling limits are. For example, if I suddenly find I need to double my data storage capacity, I can drive to a store and come back with a trunk full of hard drives the same day. I can only do that because someone already build the hard drives, and someone stocked the nearby stores with them. If a hyperscaler needs to double their capacity, they need to plan it well in advance, allocating a substantial fraction of global hard drive manufacturing capacity. They can't just assume someone would have already built the hardware, much less have it in stock near where it's needed.


Which FAANG is rewriting to another hip language and chasing trends (especially when it comes to infra services??)? I don't mean to be rude, but it doesn't sound like you are talking about any of the FAANGs, this sounds completely made up.



FAANG is an acronym for Facebook, Amazon, Apple, Netflix, Google. Uber isn't in the same ballpark as those companies (arguably Netflix isn't really in the same ballpark as the other four either...).


Heh, I wish they still looked the same. They added an order of magnitude of HTML and JS bloat while removing functionality.


Had that issue in my previous job.

Higher management decided to migrate our properitary vendor locked platform from one cloud provider to the other one. Majority of migration fell on a single platform team that was constantly struggling with attricion.

Unfortunately I was not able (neither our architects) to explain the higherups that we need bigger team and overall way more resources to pull that off.

Hope that someone that comes after me will be able to make the miracle happen.


I usually move on to a different project/team/company when it gets to this. E.g. my new team builds a new product that grows like crazy and has its own set of challenges. I prefer to be deliver immediate customer value vs. long term hard to sell and hard to project the value work.


"That wall might be 2 years in the future so usually it is next to impossible to convince anyone to take engineers out of product development. Even more so because working on this changes absolutely nothing for the end user"

It seems to be the same story in fiels of Infrastructure maintenance, Aircraft design (boeing Max), and mortgage CDOs (2008). Was it always like this or the new management doesn not care untill something explodes?


a manufacturing company is designed the ground up to works whit machine but isn't the same whit software, is hard to understand that triple data isn't only triple server but a totaly different software stack, and exponentially more complex is not only put more factories like textile.


There's still order of magnitude change analogies to real world processes, if people are willing to listen (which is the hard part). Use something that everybody can understand, like making pancakes or waffles or an omelet. Going from making 1 by hand, every 4 minutes at home for your family, to 1,000 pancakes per minute at a factory is obviously going to take a better system. You can scale horizontally, and do the equivalent of putting more VMs behind the load balancer, and hire 4,000+ people to cook, but you still need to have/make that load balancer in the first place for even that to work.

That's the tip of iceberg when going from 1 per 4 minutes to 1,000 per minute though. How do you make and distribute enough batter for that system, and plating and serving that is going to take a pub/sub bus, err, conveyor belt to support the cooks' output. Again though, you still gotta make that kafka queue, err, conveyor belt, plus the maintenance for that is going to a team of people if you need the conveyor belt to operate 24/7/52. If your standards are so high that the system can never go down for more than 52.6 minutes per year or 13.15 minutes per quarter, then that team needs to consist of highly-trained and smart (read: expensive) people to call when the system breaks in the middle of the night.


You had problems with management of a cloud based api and executive visibility… so you bought a set of data centers to handle 500mio req per month?

The visibility you will get after the capex when there’s a truly disastrous outage will be interesting.


Hmm that’s only 190Hz on average, but we don’t know what kind of search engine it is. For example if he’s doing ML inference for every query, it would make perfect sense to get a few cabinets at a data center. I’ve done so for a much smaller project that only needs 4 GPUs and saved a ton of money.


Nah, it's text-only requests returning JSON arrays of which newspaper article URLs mention which influencer or brand name keyword.

The biggest hardware price point is that you need insane amounts of RAM so that you can mmap the bloom hash for the mapping from word_id to document_ids.


You could have used a sharded database like Mongo. Just throw up 10 shards, use "source" (influencer or brand name) as shard key?


Yes, I could have used Mongo, but it would have been 100x to 1000x slower than an mmap-ed look up table.


Why ever use mmap instead of sharded inverted indices of word-doc here, a la elasticsearch?


Yeah the question is what level of performance you need I guess... was hoping you could clarify :)


But you don't actually need that level of performance? You've made this system more complex and expensive to achieve a requirement that doesn't matter?


you seem to have a deeper knowledge of the business & organisational context that dictate the true requirements than someone working there. please share these details so we can all learn!


Sure: the network request time of a person making a request over the open internet is going to be an order of magnitude longer than a DB lookup (in the right style, with a reverse-index) on the scale of data this person is describing. So making the lookup 10x faster saves you...1% of the request latency.

And at the qps they've described, it's not a throughput issue either. So I'm pretty confident in saying that this is a case of premature optimization.

And at some point the increase in parallelization of scans dominates mmap speed, unless you're redundantly sharding your mmaped hash table across multiple machines. And there are cases where network bandwidth is the bottleneck before disk bandwidth, though probably not this case. But yeah basically, the answer is something like "if this is the optimal choice, it probably didnt matter that much".


This reads to me as if you have never really used mmap in a dedicated C/C++ application. Just to give you a data point, looking up one word_id in the LUT and reading 20 document_ids from it takes on average 0.0000015 ms.

So if that alternative database takes on average 0.1ms per index read, then it's starting out roughly 65000x slower.

"than a DB lookup (in the right style, with a reverse-index)"

Unless, of course, you're managing petabytes of data ;)

"at the qps they've described, it's not a throughput issue either"

It's mostly a cost thing. If a single request takes 2x the time, that's also a 2x on the hosting bill.

"parallelization of scans dominates mmap speed"

Yes, eventually that might happen. Roughly when you have 100000 servers. But before that your 10gbit/s node-to-node link will saturate. Oops.


> Unless, of course, you're managing petabytes of data ;)

Are...are you saying that you've purchased petabyte(s) of RAM, and that that multi-million dollar investment is somehow cheaper than...well really anything else?

> But before that your 10gbit/s node-to-node link will saturate. Oops.

Only if you're returning dense results, which it sounds like you aren't (and there are ways to address this anyhow), which is why I said the issue of saturating network before disk probably wasn't an issue for you ;)


No, of course I have a tiered architecture. HDDs + SSDs + RAM. By mmap-ing the file, the Linux kernel will make sure that whatever data I access is in RAM and it'll do best-effort pre-reading and caching, which works very well.

BTW, this is precisely how "real databases" also handle their storage IO internally. So all of the performance cost I have to pay here, they have to pay, too.

But the key difference is that with a regular database and indices, the database needs to be able to handle read and write loads, which leads to all sorts of undesirable trade-offs for their indices. I can use a mathematically perfect index if I split dataset generation off of dataset hosting.

It's really quite difficult to explain, so I'll just redirect you to the algorithms. A regular database will typically use a B-tree index, which is O(log(N)). I'm using a direct hash bucket look-up, which is O(1).

For a mental model, you can think of "mmap" as "all the results are already in RAM, you just need to read the correct variable". There is no network connection, no SQL parsing, no query planning, no index scan, no data retrieval. All those steps would just consume unnecessary RAM bandwidth and CPU usage. So where a proper DB needs 1000+ CPU cycles, I might get away with just 1.


No modern DB uses mmap because it's unreliable and hard to tune for performance.

A custom cache manager will always perform better than mmap provided by the kernel.

The problem is you haven't explained how the overhead of a DB is too much. Sure, it sounds like a lot of work for your servers and the DB compared to reading from a hashmap.

Where I work right now we fire around 1.5B queries a day... to Mongo.


And have your unreliable, iconsistent, unscalable system. That apparently goes down all the time.

Not using ES here is actually nuts.


Are you managing petabytes of data though?

What kind of servers are you running? What's your max QPS?

The fact is with your mmap impl. you probably use ram + virtual memory, and have more ram than needed to compensate for the fact that you don't keep the most used keys in memory, which a DB will do for you.

Point is if you have petabytes of data and access patterns only mean you access a subset of it, even Mongo might be cheaper to run.


Just FYI, MongoDB storage also uses mmap internally.

So we are comparing here "just mmap" with "mmap + all that connection handling, query parsing, JSON formatting, buffering, indexing, whatever stuff that MongoDb does".

And no, MongoDB is effectively never a cheap solution. They are used because they are super convenient to work with, with all things being JSON documents. But all that conversion to and from JSON comes at a price. It'll eat up 1000s of CPU cycles just to read a single document. With raw mmap, you could read 1000s of documents instead.


MongoDB uses the Wired Tiger storage engine internally. The MMAP storage engine was removed from MongoDB in V4.2 which was released in March 2020. The MMAP engine was deprecated two years previously.

In MongoDB conversion to and from raw JSON into BSON (Binary JSON) is done on the client (aka driver) so the server cycles are not consumed.


And 2 you're looking past the point. Any DB work work fine for this use case. If you wanted sharding, there's vitess for mysql, for example.


As another already said, Mongo doesn't use mmap anymore.

Mongo doesn't convert to and from JSON. The driver uses a binary protocol.


As a security guy I HATE the loss of visibility in going to the cloud. Can you duplicate it? Sure. Still not as easily as spanning a trunk and you still have to trust what you’re seeing to an extent.


The visibility I was mentioning in the parent comment was visibility from executives in your business, but I can see how it would be confusing.

There are tradeoffs — cloud removes much of the physical security risks and gives you tools to help automated incident detection. Things like serverless functions let you build out security scaffolding pretty easily.

But in exchange you do have to give some trust. And I totally understand resistance there.


> cloud removes much of the physical security risks

Doesn't cloud increase the physical security risks, rather than decrease/remove?


You might be surprised. The performance equivalent of $100k monthly in EC2 spend fits into a 16m2 cage with 52HU racks.


Which costs you more than $100k monthly to operate with the same level of manageability and reliability.

We don't use AWS, because our use cases don't require that level of reliability and we simply cannot afford it, but if I needed a company to depend on IT that generates enough revenue... I probably wouldn't argue about the AWS bill. So long, prepaid at hetzner + in-house works good enough, but I know what I cannot offer with the click of a button to my user!


This is a religious debate among many. The IT/engineering nerd stuff doesn’t matter at all. Cloud migration decisions are always made by accounting and tax factors.

I run two critical apps, one on-prem and one cloud. There is no difference in people cost, and the cloud service costs about 20% more on the infrastructure side. We went cloud because customer uptake was unknown and making capital investments didn’t make sense.

I’ve had a few scenarios where we’ve moved workloads from cloud to on-prem and reverse. These things are tools and it doesn’t pay to be dogmatic.


> These things are tools and it doesn’t pay to be dogmatic.

I wish I would hear this line more often.

So many things today are (pseudo-) religious now. The right frsmework/language, cloud or on prem, x vs not x.

Especially bad imho when somebody tries to tell you how you could do better with 'not x' instead of x you are currently using without even trying to understand the context this decision resides in.

[Edit] typo


> So many things today are (pseudo-) religious now. The right frsmework/language, cloud or on prem, x vs not x.

Might have always been that way? We just have so many more tools to argue over now.


that cage is a liability, not an asset. How is the networking in that rack? What's its connection to large-scale storage (IE, petabytes, since that's what I work with). What happens if a meteor hits the cage? Etc.


That depends on what contracts you have. You could have multiple of these cages in different locations. Also, 1 PB is only 56 large enterprise HDDs. So you just put storage into the cage, too.

But my point wasn't about how precisely the hardware is managed. My point was that with a large cloud, a mid-sized company has effectively NO SUPPORT. So anything that gives you more control is an improvement.


"1 pb is only 56 large enterprise hdds".

umm, what happens when one fails?

With large cloud my startup had excellent support. We negotiated a contract. That's how it works.


Typically people use RAID or ZFS to prevent data loss when a few hdds fail.


OK, so basically you're in a completely different class of expectations about how systems perform under disk loss and heavy load then me. A drive array is very different from large-scale cloud storage.


Hard to say. My impression is:

- A large ZFS pool of SSDs is much faster than any cloud storage.

- Cloud storage failed much more often than the SSDs in our pool.

- "Noisy neighbor" is an issue on the cloud


This cracked me up. Thanks fxtentacle :D.


of course, the reason that's wrong is that if one drive fails you don't have a 56pb storage system, you have something smaller because of redundancy.

That redundancy, and the performance that scales due to it, place cloud services in an entirely different class from on prem servers.


>I used to feel powerless and stressed out by the complexity and the scale, because whenever stuff broke (and it always does at this scale), I had to start playing politics, asking for favors, or threatening people on the phone to get it fixed. Higher management would hold me accountable for the downtime even when the whole S3 AZ was offline and there was clearly nothing I could do except for hoping that we'll somehow reach one of their support engineers.

If the business can't afford to have downtime then they should be paying for enterprise support. You'll be able to connect to someone in < 10 mins and have dedicated individuals you can reach out to.


You never hosted on AWS, did you?


In the two years I worked on serverless AWS I filed four support tickets. Three out of those four I came up with the solution or fix on my own before support could find a solution. The other ticket was still open when I left the company. But the best part is when support wanted to know how I resolved the issues. I always asked how much they were going to pay me for that information.


>You never hosted on AWS, did you?

Previously 2k employee company, with the entire advertising back office on AWS.

Currently >$1M YR at AWS, you can get the idea of scale & what is running, here: https://www.youtube.com/playlist?list=PLf-67McbxkT6iduMWoUsh...


Enterprise Support never disappointed me so far. Maybe not <10 minute response time, but we never felt left alone during an outage. But I guess this is also highly region/geo dependent.


>"they should be paying for enterprise support"

This sounds a bit arrogant. I think they found better and overall cheaper solution.


>This sounds a bit arrogant.

The parent thread talks about how the business could not go down even with a triple AZ outage for S3, and I don't think it is arrogant to state they should be paying for enterprise support if that level of expectation is set.

>I think they found better and overall cheaper solution.

Cheaper solution does not just include the cost but also the time. For the time we need to look at the time they spent regardless of department to acquire, migrate off of AWS, modifying the code to work for their multi-private cloud, etc. I'd believe it if they're willing to say they did this, have been running for three years, and compiled the numbers in excel. It is common if you ask internally was it worth it to get a yes because people put their careers on it and want to have a "successful" project.

The math doesn't work out in my experiences with clients in the past. The scenarios that work out are, top 30 in the enitre tech industry, significant GPU training, egress bandwidth (CDN, video, assets), or business that are selling basically the infrastructure (think Dropbox, Backblaze, etc.).

I'm sure someone will throw down some post where their cost, $x is less than $y at AWS, but that is such a tiny portion that if the cost is not >50% it isn't even worth looking at the rest of the math. The absolute total cost of ownership is much harder than most clickbait articles are willing to go into. I have not seen any developers talk about how it changes the income statement & balance sheet which can affect total net income and how much the company will lose just to taxes. One argument assumes that it evens out after the full amortization period in the end.

Here are just a handful of factors that get overlooked, supply chain delays, migration time, access to expertise, retaining staff, churn increase due to pager/call rotation, opportunity cost of to capital being in idle/spare inventory and plenty more.


Back then, it was enough to saturate the S3 metadata node for your bucket and then all AZs would be unable to service GET requests.

And yes, this won't be financially useful in every situation. But if the goal is to gain operational control, it's worthwhile nonetheless. That said, for a high-traffic API, you're paying through the nose for AWS egress bandwidth, so it is one of those cases where it also very much makes financial sense.


Same fxtenatcle as CTO of ImageRights? If that is the case my follow up question is did you actually move everything out of AWS? Or did you just take the same Netflix approach like Open Connect for the 95th billing + unmetered & peering with ISPs to reduce.


So you basically saying that no matter what one should always stick to Amazon. I have my own experience that tells exactly the opposite. To each their own. We do not have to agree.


>So you basically saying that no matter what one should always stick to Amazon.

What I am saying is given the list of exceptions I gave the business should run/colocate their gear if they're in the exception list or those components that fall in the exception list should be moved out.

>I have my own experience that tells exactly the opposite.

You begin using AWS for your first day ever and on that day it has a tri AZ outage for S3. In this example the experience with AWS has been terrible. Zooming out though over 5 years it wouldn't look like a terrible experience at all considering outages are limited and honestly not that frequent.


>"You begin using AWS for your first day ever"

I am not talking about outages here. Bad things can happen. More like a price.


I don't read that as arrogant. The full statement is:

> If the business can't afford to have downtime then they should be paying for enterprise support.

It's simply stating that it's either cheaper for business to have downtime, or it's cheaper to pay for premium support. Each business owner evaluates which is it for them.

If you absolutely can't afford downtime, chances are premium support will be cheaper.


@fxtentacle, I was curious which private search engine this is for. Is the system you are describing ImageRights.com?


No, ImageRights is much more requests and mostly images. Also, at ImageRights I don't have management above me that I would need to convince :)

This one is text-only and used by influencers and brands to check which newspapers report about their events. As I said, it's internally used by a few partner companies who buy the API from my client and sell news alerts to their clients.

BTW, I'm hoping to one day build something similar as an open source search engine where people pay for the data generation and then effectively run their own ad-free Google clone, but so far interest has been _very_ low:

https://news.ycombinator.com/item?id=30374611 (1 upvote)

https://news.ycombinator.com/item?id=30361385 (5 upvotes)

EDIT: Out of curiosity I just checked and found my intuition wrong. The ImageRights API averages 316rps = 819mio requests per month. So it's not that much bigger.


If you rely on public cloud infrastructure, you should understand both the advantages and disadvantages. Seems like your company forgot about the disadvantages.


What i read here was "Cloud is hard, so I took on even more responsibility"


What you should read is: At the monthly spend of a mid-sized company, it is impossible to get phone support from any public cloud provider.


What are you using for aws alternatives? Example for S3?


>What are you using for aws alternatives? Example for S3?

Not OP but they're probably using Rook/Minio


docker + self-developed image management + CEPH


Care to share uptime metrics on AWS vs your own servers?


That wouldn't be much help because the AWS and Heroku metrics are always green, no matter what. If you can't push updates to production, they count that as a developer-only outage and do not deduct it from their reported uptime.

For me, the most important metric would be time that me and my team spent fixing issues. And that went down significantly. After a year of everyone feeling burned out, now people can take extended vacations again.

One big issue for example was the connectivity between EC2 servers degrading, so that instead of the usual 1gbit/s they would only get 10mbit/s. It's not quite an outage, but it makes things painfully slow and that sluggishness is visible for end users. Getting reliable network speeds is much easier if all the servers are in the same physical room.


What do you find exhausting?

One anti-pattern I've found is that most orgs ask a single team to handle on-call around the clock for their service. This rarely scales well, from a human standpoint. If you're getting paged at 2:00 in the morning on a regular basis you will start to resent it. There's not much you can do about that so long as only one team is responsible for uptime 24/7.

The solution is to hire operations teams globally, and then setup follow-the-sun operations whereby the people being paged are always naturally awake at that hour, and allows them to work normal eight hour shifts. But this requires companies to, gasp, have specialized developers and specialized operators collaborate before allowing new feature work into production, to ensure that the operations teams understand what the services are supposed to do and keep it all online. It requires (oh, the horror!) actually maintaining production standards, runbooks, and other documentation.

So naturally, many orgs would prefer to burn out their engineers instead.


I would respectfully say that you are wrong. I speak from experience. At Netflix we tried to hire for around the clock coverage. But what ended up working much better was taking that same team and having each person on call for a week at a time, all based in Pacific Time.

Yes, you would get calls at 2am, sometimes multiple days in a row. But you were only on call once every six to eight weeks, and we scheduled out well in advance so you could plan your life accordingly.

As a bonus, for the five weeks you weren't on call, you were highly incentivized (and had the time) to build tools or submit patches to fix the problems that woke you at 2am.

> It requires (oh, the horror!) actually maintaining production standards, runbooks, and other documentation.

I disagree with this too. Documentation and runbooks are useless in an outage. Instead of runbooks, write code to do the thing. Instead of documentation, comment the code and build automation to make the documentation unnecessary, or at least surface the right information automatically if you can't automate it.


This is the same approach as night shifts for nurses.

There’s a lot of evidence to suggest that the effects on this infrequent but consistent disturbance to their circadian rhythms causes all kinds of physiological damage. One example [1]. We have to do better. I think the original suggestion of finding specialised night workers or those in other timezones is more humane.

[1] https://blogs.cdc.gov/niosh-science-blog/2021/04/27/nightshi...


That article is about night shift work, not day shift work that occasionally makes you work an hour or two at night every six weeks.


Here is a reference that is a bit more attributable to the on call experience. There is a tangible human cost to after hours responses during an on call rotation. I personally do not recommend on call roles to any technology professional who can avoid them due to these health consequences of an on call requirement.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5449130/

> Sleep plays a vital role in brain function and systemic physiology across many body systems. Problems with sleep are widely prevalent and include deficits in quantity and quality of sleep; sleep problems that impact the continuity of sleep are collectively referred to as sleep disruptions. Numerous factors contribute to sleep disruption, ranging from lifestyle and environmental factors to sleep disorders and other medical conditions. Sleep disruptions have substantial adverse short- and long-term health consequences. A literature search was conducted to provide a nonsystematic review of these health consequences (this review was designed to be nonsystematic to better focus on the topics of interest due to the myriad parameters affected by sleep). Sleep disruption is associated with increased activity of the sympathetic nervous system and hypothalamic–pituitary–adrenal axis, metabolic effects, changes in circadian rhythms, and proinflammatory responses. In otherwise healthy adults, short-term consequences of sleep disruption include increased stress responsivity, somatic pain, reduced quality of life, emotional distress and mood disorders, and cognitive, memory, and performance deficits. For adolescents, psychosocial health, school performance, and risk-taking behaviors are impacted by sleep disruption. Behavioral problems and cognitive functioning are associated with sleep disruption in children. Long-term consequences of sleep disruption in otherwise healthy individuals include hypertension, dyslipidemia, cardiovascular disease, weight-related issues, metabolic syndrome, type 2 diabetes mellitus, and colorectal cancer. All-cause mortality is also increased in men with sleep disturbances. For those with underlying medical conditions, sleep disruption may diminish the health-related quality of life of children and adolescents and may worsen the severity of common gastrointestinal disorders. As a result of the potential consequences of sleep disruption, health care professionals should be cognizant of how managing underlying medical conditions may help to optimize sleep continuity and consider prescribing interventions that minimize sleep disruption.


> But what ended up working much better was taking that same team and having each person on call for a week at a time, all based in Pacific Time.

Our support team does the same, and they seem to be quite happy with it. They also get the following Friday off (in addition to compensation).

They do their best to shield us developers from after-hour calls, usually one can get things moving enough that it can be handled properly in the morning.


Even as a dedicated operations team for a product, we did this too. On call person worked tickets and took calls for one week at a time, the rest of the team worked on ways to make on-call suck less. For an eight person team it worked well for about three years until bigger stuff happened in the org and we all parted ways.


I agree with you completely, especially on the last paragraph. No pain - no gain.


> you were highly incentivized (and had the time) to build tools or submit patches to fix the problems that woke you at 2am.

Ah, so you worked on a team where the SRE needs were prioritized over the feature requests? Because in most companies where I've worked, Product + Customer Service + Sales + Marketing + Executives don't really have time or patience for the engineers to get their diamond polishing cloths out. They want to see feature development. They're willing to be forced to prioritize exactly which feature they'll get soonest, and they understand that engineering needs time to keep the systems running, but in most businesses I've worked, the business comes first.

> Documentation and runbooks are useless in an outage. Instead of runbooks, write code to do the thing. Instead of documentation, comment the code and build automation to make the documentation unnecessary

We do that too. If you could write code to Solve All The Problems then you'd never need to page a human in the first place ;)

I'll give you a simple example of where you can't write code to solve this sort of thing. Let's say that you have an autoscaler that will scale your server group up to X servers. You define an alert to page you if the autoscaler hits the maximum. The page goes off. Do you really want to write code to arbitrarily increase the autoscaler maximum whenever it hits the maximum? Why do you have the maximum in the first place? The entire reason why the autoscaler maximum exists is to prevent cost overruns from autoscaling run amok. You want a human being, not code, to look at the autoscaler and make the decision. Do you have steady-slow growth up to the maximum? Maybe it should be raised, if it represents natural growth. Maybe it shouldn't, if you just raised it last week and it shouldn't be anywhere near this busy. Do you have hockey-stick growth? Maybe the maximum is working as expected, looks like a resource leak hit production. Or maybe you have a massive traffic hit and you actually do want to increase the maximum. Maybe you'd prefer to take the outage from the traffic hit, let the 429s cool everyone off. But good luck trying to write code to handle that automatically, and correctly for you!

> or at least surface the right information automatically if you can't automate it.

Ah, well, that's exactly what the dedicated operations staff are doing, because when you have three follow-the-sun teams, you need standards, not three sets of people who each somehow telepathically share the same tribal knowledge?

Don't get me wrong, I'm not anti-automation or something. If your operations folks are click-clicking in consoles all day long, the same click-clicking every day, probably something's wrong. But the SRE model asks for operations automation to stick within operations teams, not development teams.


> Ah, so you worked on a team where the SRE needs were prioritized over the feature requests?

Yes, it was an SRE team. All we do is write tools to make operations better, but more importantly we write tools to make it easier for the dev teams to operate their own systems better. But yes, we had products teams that would push back on our requests because they had product to deliver, and that was fine. We'd either figure out how to do the work for them, or figure out a workaround.

> We do that too. If you could write code to Solve All The Problems then you'd never need to page a human in the first place ;)

Well yes, that's the idea. You can't get to 5 9s of reliability unless it's all automated. :)

> I'll give you a simple example of where you can't write code to solve this sort of thing.

I could easily write code to solve the thing. Step one, double the limit to alleviate immediate customer pain. Step two, page someone to wake up and look at the graphs and figure out what the better medium term solution is to get us through until the morning, including links to said relevant graphs.

You're not gonna have a cost overrun doubling the limit for one night. And if there is a big problem, the person will get paged again a few hours later and have more information to make a better decision.

> But the SRE model asks for operations automation to stick within operations teams, not development teams.

Yes, but I'm not sure I see why that's bad. I don't see any purpose for a dedicated operations team, especially a follow the sun team. If you're Google and you already have offices all around the world, sure, it will be better. But it makes no sense to hire an around the world team just for operations if the rest of your company is in one time zone.


> Yes, it was an SRE team. All we do is write tools to make operations better

Go back to my original comment. If you're an SRE team, then basically, you're the operations team for the developers. I'm talking about where developers are responsible for their own operations and there is no team that gets paged instead of them - "most orgs ask a single team to handle on-call around the clock for their service."

> Step one, double the limit to alleviate immediate customer pain. Step two, page someone to wake up

See, what I read from this is: a) violate my system efficiency KPIs while b) paging someone in the middle of the night anyway. So, lose-lose.

> But it makes no sense to hire an around the world team just for operations if the rest of your company is in one time zone.

Why does it make any more sense to hire developers remotely who are in your time zone ± three hours? Because that's what most companies are doing these days. If you're already hiring people remotely then you can hire Operations/SRE staff a little further afield and see that as a benefit (follow the sun) rather than a problem.

> the rest of your company is in one time zone.

For what it's worth, we also hire salespeople around the globe :) Fact of the matter is, it would be so, so nice for Slack to turn off the ability to @channel in the #random channel so that people who are asleep don't get pinged ...


We were an SRE team building tools for the development teams who got paged in the middle of the night. The devs writing the services were operating their own services and were getting paged. We would sometimes also get paged for a serious incident so we could coordinate if multiple development teams were involved.

Each team managed their own rotation schedules, we just made sure they had one.

> See, what I read from this is: a) violate my system efficiency KPI

If you're being graded on your system efficiency and not customer satisfaction, well then sure, your way might make sense (but I'd still say it doesn't). But your business will suffer if you optimize for efficiency over customer satisfaction.

> Why does it make any more sense to hire developers remotely who are in your time zone ± three hours?

Because it's a lot easier to run a team where everyone on the team can meet at the same time. If you have an around the world team, there is no time of day where you can have a meeting and everyone gets to attend during their workday. Realistically you can maybe get away with a nine hour time difference. Any more than that and you have people excluded.

Especially if the bulk of your devs are in one or two time zones, your operators will be even more disconnected from them since they will never be able to interact with the devs, and the devs will have no empathy for the operators who they also never interact with.

> For what it's worth, we also hire salespeople around the globe

Sure, but they aren't writing code that your operators have to run. :)

I think we both agree that it's better for devs to get paged for their services instead of operators, and if that's the case, its far better for all the devs to work together and know each other and be in the same or nearly same time zone.

A follow the sun model breaks that completely.


> But your business will suffer if you optimize for efficiency over customer satisfaction.

But who are the customers? Business, engineering, or finance? :)

> it's a lot easier to run a team where everyone on the team can meet at the same time.

Of course it's easier. It's also easier not to maintain documentation or standards, just be a five person startup and have everyone be in the same room. Enterprise communication is hard! Even when you're in the same time zone. The question isn't "how do I get my life to be a utopia?" but "which challenges should I choose?". If you run an organization, you need to put your employees first, even ahead of your customers. Employees and customers both come and go but 80% of the time the effect of an valued employee leaving is far worse than a customer leaving, and you have far more control over whether employees leave than whether customers do. So you can either put your employees first (build a calm workplace) or you can put your customers first (prioritize feature development velocity in organizational design).

> I think we both agree that it's better for devs to get paged for their services instead of operators

No! Dev should never be paged! If I "buy" Jenkins off-the-shelf, and it breaks down in production, guess what, I don't get to page the Jenkins developers! Why should internally developed services be any different? If Ops needs to page someone from Dev instead of waiting for a response at normal business cadence, then this is an Ops failure, not a Dev failure!


> But who are the customers? Business, engineering, or finance? :)

The business's customers. The ones who pay your company so they can pay you, and your reason for having a job at all.

> Why should internally developed services be any different?

Because they're your core competency and you have control over it. If you could page the Jenkins developers you probably wouldn't hesitate to do it, because you'll get better results. Why not get the best results you can from an internal service?

> If Ops needs to page someone from Dev instead of waiting for a response at normal business cadence, then this is an Ops failure, not a Dev failure!

I couldn't disagree more. That is absolutely a dev failure -- they wrote a service that couldn't operate under the conditions given. It's either a bug or an architecture issue, but no matter what, it's a dev issue and the dev should be responsible for building a service that can actually run in production.

You and I have very different ideas of a successful engineering organization. I would never want to work for your org as an operator or a dev. As an operator the last thing I want is devs to throw whatever they write over the wall and then say "not my problem anymore!", and have to rely on getting retrained every time the code changes. And as a dev I wouldn't want to be in an organization that accepts sloppy developers who aren't responsible for building solid code that can run under adverse conditions and who don't get to experience the issues in production for themselves.

Facebook makes their devs get paged, Netflix does, Amazon pages their devs, Dropbox pages devs, Stripe pages devs, and Google pages their devs too until they have demonstrated multiple quarters of success, and only then does an operator take over. And if the service has too many failures, support falls back on the devs until they can make the service stable again.

Making devs responsible for creating code that actually works well in production is a good thing.


> As an operator the last thing I want is devs to throw whatever they write over the wall and then say "not my problem anymore!", and have to rely on getting retrained every time the code changes. And as a dev I wouldn't want to be in an organization that accepts sloppy developers who aren't responsible for building solid code that can run under adverse conditions and who don't get to experience the issues in production for themselves.

How can you classify anybody who writes on-prem software as being a "sloppy developer"? Jira, Jenkins, GitLab, pretty much any database you can imagine (MySQL, PostgreSQL, Redis, Elasticsearch, Kafka...), Grafana, any Linux distribution, they're all written by "sloppy developers"?

Where did I say that Dev gets to "throw code over the wall"? How would you feel if I unilaterally decided for you, as a developer, which tools you get to use? If I came up with some policy that the whole organization can only run Windows machines and I "threw that policy over the wall" at you?

You're arguing against a strawman that's completely inconsistent with how harmonious follow-the-sun Ops actually works.


I would not call follow the sun ops as harmonious. If anything I'd call it adversarial. Ops is always trying to blame dev for outages and dev is always trying to blame ops. Each accuses the other of not sharing all the necessary information.

Look at all that on-prem software you just mentioned. The developers of every one of those complain that they need better bug reports, and the people who operate them complain they need better documentation. Things would be much better if those devs worked directly for every company that uses them, and in fact in a lot of cases one of the contributors is an operator at a company. Why do you think companies like to hire open source devs? To get better access to someone who knows the codebase!

It's far better if the operator is the developer. Sometimes we live with that not being the case because the software is made by others. But when given the choice, I will always opt for the dev running the software themselves.


> Step one, double the limit to alleviate immediate customer pain.

I've been oncall for systems where that would not work.

Doubling the memory means you need twice as many machines. Depending on the service, that could require significantly increased network bandwidth. Now the network is saturated and every node needs to queue more data. Now latency and throughput are even worse, and even more requests are being dropped, so you automatically double the limit again...


While that all may be true (but are indications of a poorly architected system), my code would still work. It would double the limit and then page someone. If they logged in and saw all those failures, then they could address those issues.

The whole point is that having an around the world follow the sun team would not alleviate those issues or make anything better.


> You want a human being, not code, to look at the autoscaler and make the decision.

Should this decision happen at 2am? Can it wait until 10am?


This. Absolutely this. Working on large distributed system can be both exhilarating and exhausting. The two often go hand in hand. However, working on such systems without diligence tips the scales toward exhausting. If your testing and your documentation and your communication (both internal and with consumers) suck, you're in for a world of pain.

"But writing documentation is a waste of time because the code evolves so fast."

Yeah, I hear that, but there's also a lot of time lost to people harried during their on-call and still exhausted for a week afterward, to training new people because the old ones burned out or just left for greener pastures, to maintaining old failed experiments because customers (perhaps at your insistence) still rely on them and backing them out would be almost as much work than adding them was, and so on.

That's not really moving fast. That's just flailing. You can actually go further faster if you maintain a bit of discipline. Yes, there will still be some "wasted" time, but it'll be a bounded, controlled waste like the ablative tiles on a re-entry vehicle - not the uncontrolled explosion of complexity and effort that seems common in many of the younger orgs building/maintaining such systems nowadays.


> That's not really moving fast. That's just flailing.

Yes, a million times yes. This is moving me. Where do I find a team that understands this wisdom?


The solution to get paged at off hours a lot is rarely to hire additional teams to cover those times for you, at least not long term. For things you can control, you should fix the root causes of those issues. For things you can't control you should spend effort on making them within your control (eg architecture improvement). This takes time, so follow-the-sun rotation might be a stop gap solution, but you need to make sure it doesn't cover over the real problems without them getting any better.


From experience, it's really hard to fix the root causes of issues when you were woken up three times the night before and had two more of the same incident occur during the workday. In my case I struggled along for a couple years but the best thing to do was just leave and let it be someone else's problem.


Best thing for what? Surely not software quality and customer satisfaction.


If they cared about that they would either pay me so much money I'd be insane to walk away or they would hire people in other time zones to cover the load. Instead they chose to pay for their customer satisfaction with my burnout. The thing about that strategy is... eventually the thing holding their customer satisfaction together gets burnt out. So I leave. And even then they're still getting the better half of the bargain.


Sorry, I accidentally said you did the wrong thing for leaving. That wasn't my intention. Of course, leaving was the right choice for you.

What I meant was the company you were working for does not get the best quality or customer satisfaction by overworking you to the point where you have to leave. It would have been better for their software quality to handle things differently.


I don’t think this is a stable long term solution. The “on call” teams end up frustrated with the engineers who ship bugs and this results in added process that delays deploys, arbitrary demands for test coverage, capricious error budgets, etc. It’s much better to have the engineers who wrote the code be responsible for running it, and if their operational burden becomes too high, to staff up the dev team to empower them to go after root causes. Plus the engineers who wrote the code always have better context than reliability people who tend to be systems experts but lack the business logic intuition to spot errors at a glance.


I don't think the parent was implying you're never on call for your code, just only on call during working hours.

One of the challenges for larger companies in trying to make teams on-call 24/7 is that your most senior engineers often have enough money that they don't have to take on-call. Some variation of the following conversation happens in Big Tech more than most people seem to anticipate:

"hey, so I have 7 mil in the bank, a house, and kids; so I'm not taking on-call anymore"

"I understand on-call is a burden, but the practice is a big part of how we maintain operational excellence"

"Alright, I quit"

"Woah woah woah, uh, ok, what about we work on transitioning you out of on call over the next 6 months?"

"Nah, I'm done"

"This is going to be really disruptive to the team!"

"Yeah man it sucks, I really feel for you"

My understanding is a few famous outages at large cloud providers are a direct result of management not anticipating these conversations and assuming 24/7 on-call from a single geographically centered team of high powered engineers was sustainable.


> The “on call” teams end up frustrated with the engineers who ship bugs and this results in added process that delays deploys, arbitrary demands for test coverage, capricious error budgets, etc.

This is poor operations culture. Software is no different from industrial manufacturing. You QA before you ship product to customers and you QA your raw materials before you start to process them. Operations is responsible for catching show-stopper bugs before they hit production. This means that operations is responsible for pushing to staging, not developers; operations stakeholders need to be looped into feature planning to ensure that feature work will easily integrate into the operations culture (somebody's got to tell the developers they can't adopt MySQL if it's a PostgreSQL shop, etc.). Fundamentally, Ops needs to be able to say No to Dev. The SRE take on it is to "hand the pager back to Dev", but the actual method of saying No is different from Ops culture to Ops culture.

> reliability people who tend to be systems experts but lack the business logic intuition to spot errors at a glance

If Dev didn't build the monitoring, the observability, put proper logging in place, etc., then honestly, Dev isn't going to spot the errors at a glance. Customer Service will when customers complain. @jedberg seems to think that Developers should write code to auto-solve their operations issues. If Developers can write code to auto-solve their operations issues, and Developers obviously anyway need to add telemetry etc., then why, pray tell, should it be so unreasonable to expect Developers to be able to succinctly add the kind of telemetry and documentation that explains the business logic, according to an Operations standard, such that Operations can thus keep the system running?


Correct. Throwing software over the wall to "other people" and letting them deal with the problems of running the software is guaranteed to lead to low quality, inefficient processes, or usually both.


I'd argue that timezone is just part of the problem. If you're responsible for a high oncall load, you are subjected to a steady, unpredictable stream of interrupts requiring you to act to minimize downtime or degradation. Obviously it's worse if you get these at night, but it's still bad during the day.

I think the anti-pattern is having one team responsible for another's burden. You want teams to both be responsible for fixing their own systems when they break, AND be empowered to build/fix their broken systems to minimize oncall incidents.


At the end of the day, there's a human cost to responding to pages, and there's a human cost to collaboration.

Both of those can drive burn out. Personally, I find all that collaboration work very hard and stressful, so I work better in a situation where I get pages for the services I control; but that would change if pages were frequent and mostly related to dependencies outside of my control. It also helps to have been working in organizations that prioritize a working service over features. Getting frequent overnight issues that can't be resolved without third party effort that's not going to happen anytime soon is a major problem that I see reports of in threads like this.

I can also get behind a team that can manage the base operations issues like ram/storage/cpu faults on nodes and networking. The runbooks for handling those issues are usually pretty short and don't need much collaboration.


My experience is that the expectations on what your average engineer should be able to handle has grown enormously during the last 10 years or so. Working both with large distributed systems and medium size monolithic systems I have seen the expectations become a lot higher in both.

When I started my career the engineers at our company were assigned a very specific part of the product that they were experts on. Usually there were 1 or 2 engineers assigned to a specific area and they knew it really well. Then we went Agile(tm) and the engineers were grouped into 6 to 9 person teams that were assigned features that spanned several areas of the product. The teams also got involved in customer interaction, planning, testing and documentation. The days when you could focus on a single part of the system and become really good at it were gone.

Next big change came when the teams moved from being feature teams to devops teams. None of the previous responsibilities were removed but we now became responsible also for setting up and running the (cloud) infrastructure and deploying our own software.

In some ways I agree that these changes have empowered us. But it is also, as you say, exhausting. Once I was simply a programmer; now I'm a domain expert, project manager, programmer, tester, technical writer, database admin, operations engineer, and so on.


It sounds like whomever shaped your teams & responsibilities didn’t take into account the team’s cognitive load. I find it’s often overlooked, especially by those who think agile means “everyone does everything”. The trick is to become agile whilst maintaining a division of responsibilities between teams.

If you look up articles about Team Topologies by Matthew Skelton and Manuel Pais, they outline a team structure that works for large, distributed systems.


I'll have a look a the book. Thanks!


> None of the previous responsibilities were removed but we now became responsible also for setting up and running the (cloud) infrastructure and deploying our own software

On the flipside, in the olden days when one set of people were churning features and another set of people were given a black box to run and be responsible for keep it running, it was very hard to get the damn thing to work reliably and the only recourse you often had was to "just be more careful", which often meant release aversion and multi-year release cycles.

Hence, some companies explored alternatives, found ways to make them work, wrote about their success but a lot of people copied only half of the picture and then complained that it didn't work.


> only half of the picture

Can you please share some details about what you think is missing from most "agile"/devops teams?


Proper staffing


Ah excellent. Yes. In my experience there's this idea of "scale at all costs"--a better way would probably be to limit scaling until the headcount is scaled. Although then you probably need more VC money.


Might I add that you are also now underpaid. I had a sweet gig at a very small company where I had to manage contractors in addition to FTE staff. The good contractors billed $300 an hour for BA and project management services alone. The story munchers billed $150 an hour.

I had to leave a contracting gig recently because we were tasked with everything...literally everything. Everyone got so burnt out--FTEs included. I also might add that the developers could have spoken up and gotten relief but their misguided work ethic prevented that.


In these large scale systems the boundaries are usually not well defined (there are APIs but data flowing through the APIs is another matter as are operational and non functional requirements).

Stress is often caused by a mismatch of what you feel responsible and accountable for and what you really control. The more you know the more you feel responsible for but you are rarely able to expand control as much or as fast as your knowledge. It helps to be very clear about where you have ultimate say (accountability) or control within some framework (responsibility) or simply know and contribute. Clear in your mind, others and your boss. Look at areas outside your responsibility with curiosity and willingness to offer support but know that you are not responsible and others need to worry.


This is spot on. Feeling frustrated working on large distributed systems could be generalized as “feeling frustrated working in a large organization” because the same limitations apply. You learn about things you cannot control, and it is important to see the difference between what you can control and contribute and what you can’t.


The first ten years of my career, I worked with distributed systems built on this stack: C++, Oracle, Unix (and to some extent, MFC and Qt). There were hundreds of instances of dozens of different type of processes (we would now call these microservices) connected via TCP links, running on hundreds of servers. I seldom found this exhausting.

The second ten years of my career, I worked with (and continue to work on) much more simpler systems, but the stack looks like this: React/Angular/Vue.js, Node.js/SpringBoot, MongoDB/MySQL/PostGreSQL, ElasticSearch, Redis, AWS (about a dozen services right here), Docker, Kubenetes. _This_ is exhausting.

When you spend so much time wrangling a zoo of commercial products, each with its own API and often own vocabulary for what should be industry standards (think TCP/IP, ANSI, ECMA, SQL), and being constantly obsoleted by competing "latest" products, that you don't have enough time to focus on code, then yes, it can be exhausting.


You know what? This is a really great point. When I reflect back on my career experience (at companies like Expedia, eBay, Zillow, etc.) the best distributed systems experience I had was at companies that standardized on languages and frameworks and drew a pretty strong boundary around those choices.

It wasn't that you technically couldn't choose another stack for a project, but to do so you had to justify the cost/benefit with hard data, and the data almost never bore out more benefit than cost.



Modern day embarrassing spaghetti cloud.


Absolutely right.


I've found that external tech requirements are horrible to work with, especially when the underlying stack simply doesn't support it. Normally these are pushed by certified cloud consultants or by an intrepid architect who found another "best practice blog."

It's begins with small requirements such as coming up with a disaster recovery plan only for it to be rejected because your stack must "automatically heal" and devs can't be trusted to restore a backup during an emergency.

Blink and you're implementing redundant networking (cross AZ route tables, DNS failover, SDN via gateways/load balancers), a ZooKeeper ensemble with >= 3 nodes in 3 AZs, per service health checks, EFS/FSX network mounts for persistent data that expensive enterprise app insists storing on-disk and some kind of HA database/multi-master SQL cluster.

... months and months of work because a 2 hour manual restore window is unacceptable. And when the dev work is finally complete after 20 zero-downtime releases over 6 months (bye weekend!) how does it perform? Abysmally - DNS caching left half the stack unreachable (partial data loss) and the mission critical Jira Server fail-over node has the wrong next-sequence id because Jira uses an actual fucking sequence table (fuck you Atlassian - fuck you!).

If only the requirement was for a DR run-book + regular fire drills.


I think this highlights the importance of actually analyzing your RP/RT (recovery point/recovery time) requirements through the lens of business value, and being honest about the ROI of buying that extra 9 in uptime.

It may be the case that 2 hours of downtime is completely unacceptable for the business, and paying $Xmm extra per year to maintain it is the right call. Or it may be that the business would be horrified to learn how many dollars are being spent to avert a level of downtime that no customer would notice or care about.

If the requirement is just being set by engineering, then it's more about finding the equilibrium where the resource spent on automation balances the cost of the manual toil and the associated morale impact on the team. Nobody wants to work on a team where everything is on fire all the time, and it's time/money well spent to avert that situation.


...how is the JIRA server mission critical? is it tied to CI/CD somehow?


In the enterprise you'll find that Jira is used for general workflow management not just CICD. I've encountered teams of analysts spend their working day moving and editing work items. It's the Quicken of workflow management solutions.

Jira Server is deliberately cobbled by the sequence table + no Aurora support and now EOL (no security updates 1 year after purchase!). DC edition scales horizontally if you have 100k.

Jira in general is a poorly thought out product (looking at you customfield_3726!) but it's held in such a high regard by users it's impossible to avoid.


Pre covid I would have laughed at this. But now, no one knows what a user story should be unless you can reas it off jira and there are no backups of course.


Gives me a fun idea: a program that randomly deletes items out of your backlog.


"Chaos engineering for your backlog"


I done that. I deleted items from the backlog that i thought make no sense (anymore), nobody cared or asked any questions. If you didn't work on it for the last 18 months, it's probably not important and nobody cares.


I used to lead teams that owned message bus, a stream processing framework and a distributed scheduler (like k8s) at Facebook.

The oncall was brutal. At some point I thought I should work on something else, perhaps even switch careers entirely. However this also forced us to separate user issues and system issues accurately. That’s only possible because we are a platform team. Since then I regained my love for distributed systems.

Another thing is, we had to cut down on the complexity - reduce number of services that talked to each other to a bare minimum. Weigh features for their impact vs. their complexity. And regularly rewrite stuff to reduce complexity.

Now Facebook being Facebook, valued speed and complexity over stability and simplicity. Specially when it comes to career growth discussions. So it’s hard to build good infra in the company.


I like that the mantra went from "move fast and break things" to (paraphrased) "move fast and don't break things".


It's been a pretty poor mantra from the beginning anyway. How about we move at a realistic pace and deliver good features, without burning out, and without leaving a trail of half-baked code behind us?


I think it's probably less fun to gradually replace things with better things than to - say - write your own alternative PHP backend.


Without more info it’s hard to say. When I felt like this, a manager recommended I start journaling my energy. I kept a Google doc with sections for each week. In each section, there’s a bulleted list of things I did that gave me energy and a list of things I did that took energy.

Once you have a few lists some trends become clear and you can work with your manager to shift where you spend time.


I love building and developing software, and despite the fun and interesting challenges presented at my last job I quit because of the operations component. We adopted DevOps and it felt like "building" got replaced with "configuring" and managing complex configurations does not tickle my brain at all. Week-long on-call shifts are like being under house arrest 24/7.

I understand the value that developers bring to operational roles, and to some extent making developers feel the pain of their screwups is appropriate. But when DevOps is 80% Ops, you need a fundamentally different kind of developer.


After-hours on-call is a thing that needs to be destroyed. A company that is sufficiently large that the CEO doesn't get woken up for emergencies needs to have shifts in other timezones to handle them. I don't know why people put up with it.


Part of it is a culture that discourages complaining about after hours work.

There's an expectation that everyone is a night owl and that night time emergency work is fun, and that these fires are to be expected.

Finally, engineers seem to get this feeling of being important because they wake up and work at night. It's really a form of insanity.


It's hard to answer this because you don't specify what exactly you find exhausting. Is it oncall? Deployment? Performance issues? Dealing with different teams? Failures and recovery? The right hand not knowing what the left hand is doing? Too many services? Something else?

It's not even clear how big your service is. You mention billions of requests per month. Every 1B requests/month translates to ~400 QPS, which isn't even that large. Like, that's single server territory. Obviously spikiness matters. I'd also be curious what you mean by "large amount of data".


> Every 1B requests/month translates to ~400 QPS, which isn't even that large

I said billions not one billion.

I guess what I find exhausting is the long feedback cycle. For example, Writing a simple script that makes two calls to different APIs requires tons of wiring for telemetry, monitoring, logging, error handling, integrating w/ two APIs, setting up the proper kubernetes manifests, setting up the required permissions to run this thing and have them available to k8s. I find all this to be exhausting. We're not even talking about operating this thing yet (on call, running in issues with the APIs owned by other teams etc)


This sounds like your team/organization needs to invest in tooling. Processes that take long should ideally be automated and done async, notification of the result is generated some time later, freeing up some of your time.


Automate that process that you find tedious; if you find it tedious, ask your coworkers if they do as well. Make the right time/automation trade offs. https://xkcd.com/1205/

Yes, work is tedious.


I find it exhilarating, but you have to have a well architected distributed system. Some key points:

- Your micro service should be able to run independently. No shared data storage, no direct access into other microservices' storage.

- Your service should protect itself from other services, rejecting requests before it becomes overloaded.

- Your service should be lenient on the data it accepts from other services, but strict about what it sends.

- Your service should be a good citizen, employing good backoffs when other services it is calling appear overloaded.

- The API should be the contract and fully describe your service's relationship to the other services. You should absolutely collaborate with the engineers who make other services, but at the end of the day anything you agree on should be built into the API.

Generally if you follow these best practices, you shouldn't have to maintain a huge working knowledge of the system, only detailed knowledge of your part, which should be small enough to fit into your mental model.

There will be a small team of people responsible for the entire system and how it fits together, but ideally if everyone is following these practices, they won't need to know details of any system, only how to read the APIs and the call graph and how the pieces fit together.


Jobs aren’t exhausting. Teams are. If you find yourself feeling this way, consider that the higher ups may be mismanaging.

There’s often not a lot of organizational pressure to change anything. So the status quo stays static. But the services change over time, so the status quo needs to change with them.


Agree with this. Conway's Law will always hold. If a company does not organize it's teams into units that actually hold full responsibility and full control/agency over that responsibility, those teams will burn out.

When getting anything done requires constant meetings, placing tickets, playing politics, and doing anything and everything to get other teams to accept that they need to work with you and prioritize your tasks so that you can get them done, you will burn out.


I don't find it exhausting, I find it *exhilarating*.

After years of proving myself, earning trust and strategical positioning I am finally leading a system that will support millions of requests per second. I love my job and this is the most intellectually stimulating activity I have done in a long while.

I think this is far from the expectation of the average engineer. You can find many random companies with very menial and low stake work. However if you work at certain companies you sign up for this.

BTW I don't think this is unreasonable. This is precisely why programmers get paid big bucks, definitely in the US. We have have a set of skills that require a lot of talent and effort, and we are rewarded for it.

Bottom line this isn't for everyone, so if you feel you are done with it that's fair. Shop around for jobs and be deliberate about where you choose to work, and you will be fine.


> I am finally leading a system that will support millions of requests per second.

This is the difference. Millions of things per second is a super hard problem to get right in any reality. Pulling this off with any technology at all is rewarding.

Most distributed systems are not facing this degree of realistic challenge. In most shops, the challenge is synthetic and self-inflicted. For whatever reason, people seem to think saying things like "we do billions of x per month" somehow justifies their perverse architectures.


Your story is close to home. I was part of a team that integrated our newly-acquired startup with a massive, complex and needlessly distributed enterprise system that burned me out.

Being forced to do things that absolutely did not make sense(CS wise) was what I found to be most exhausting. Having no other way than writing shitty code or copying functionality into our app led me to an eventual burnout. My whole career felt pointless as I was unable to apply any of my skills and expertise that I learned over all these years, because everything was designed in a complex way. Getting a single property into an internal API is not a trivial task and requires coordination from different teams as there are a plethora of processes in place. However I helped to build a monstrous integration layer and everything wrong with it is partly my doing. Hindsight is 20/20 and I now see there really was no other, better way to do it, which feels nice in a schadenfreude kind of way.

I sympathise with your point about not understanding what is expected of an average engineer nowadays. Should you take initiative and help manage things, are you allowed to simply write code and what should you expect from others were amongst my pain points. I certainly did not feel rewarded for going the extra mile, but somehow felt obliged because of my "senior" title.

I took therapy, worked on side projects and I'm now trying out a manager role. My responsibilities are pretty much the same, but I don't have to write code anymore. It feels empowering to close my laptop after my last Zoom meeting and not think about bugs, code, CI or merging tomorrow morning because it's release day tomorrow.

But hey, grass is always greener on the other side! I think taking therapy was one of my life's best decisions after being put through the ringer. Perhaps it will help you as well!


It's exhausting when the business does not give you the support you need and leans on you to do too much work. Find another place to work where they do things without stress (ask them in the interview about their stress levels and workload). Make sure leadership are actively prioritizing work that shores up fundamental reliability and continuously improves response to failure.

When things aren't a tire fire, people will still ask you to do too much work. The only way to deal with it without stress is to create a funnel.

Require all new requests come as a ticket. Keep a meticulously refined backlog of requests, weighted by priorities, deadlines and blockers. Plan out work to remove tech debt and reduce toil. Dedicate time every quarter to automation that reduces toil and enables development teams to do their own operations. Get used to saying "no" intelligently; your backlog is explanation enough for anyone who gets huffy that you won't do something out of the blue immediately.


> We run into very interesting problems due to scale (billions of requests per month for our main public apis) and the large amount of data we deal with.

So, if you are handling 10 billion requests per month, that would average out to about 4k per second.

Are these API calls data/compute intensive, or is this more pedestrian data like logging or telemetry?

Any time I see someone having a rough time with a distributed system, I ask myself if that system had to be distributed in the first place. There is usually a valuable lesson to be learned by probing this question.


Yes! A single machine can handle tons of traffic in many cases.


That question probably needs more information.

But your 'average engineer' is probably better served by asking themselves the question whether the system really needed to be that large and distributed rather than if working on them is exhausting. The vast bulk of the websites out there doesn't need that kind of overkill architecture, typically the non-scalable parts of the business preclude needing such a thing to begin with. If the work is exhausting that sounds like a mismatch between architecture choice and size of the workforce responsible for it.

If you're an average (or even sub average) engineer in a mid sized company stick to what you know best and how to make that work to your advantage, KISS. A well tuned non-distributed system with sane platform choices will outperform a distributed system put together by average engineers any day of the week, and will be easier to maintain and operate.


I find it "exhilirating," not "exhausting." But I also don't think that "...your average engineer should now be able to handle all this." That is where we went completely wrong as an industry. It used to be said that what we work on is complex, and you can either improve your tools or you can improve your people. I've always held that you will have to improve your people. But clever marketing of "the cloud" has held out the false promise that anyone can do it.

Lies, lies, and damn lies, I say!

Unless you have bright and experienced people at the top of a large distributed systems company, who have actually studied and built distributed systems at scale, your experience of working in such a company is going to suck, plain and simple. The only cure is a strong continuous learning culture, with experienced people around to guide and improve the others.


Yeah, large-scale systems are often boring in my experience, because the scale limits what features you can add to make things better. Each and every decision has to take scale into account, and it's tricky to try experimenting.

I think it has to do with the kind of engineer you are. Some engineers love iterating and improving such systems to be more efficient, more scalable, etc. But it can be limiting due to the slower release cycles, hyper focus on availability, and other necessary constraints.


I don't think they are boring, but very important on the kind of engineer you are. At AWS I try to encourage people who like the problem space and at the very least appreciate it, but can totally understand that you don't want to do your entire career on it. Many of our younger folks have never felt the speed and joy you can get with hammering out a simple app (web, python, ML) that doesn't have to work at scale.


Recently I was asked to work on a older project for enterprise customers. And we are always weary of working on old unmaintained code

But it just felt like a breath of fresh air

All code in same repository, UI, back-end, SQL, MVC style Fast from feature request to deliver in production. Changes, test, fix bugs, deploy. We were happy and the customers too

No cloud apps, buckets, secrets, no oauth, little configuration, no docker, no micro services, no proxies, no CICD. It does look somewhere along the way we overcomplicated things


100% agree with you. OAuth + Docker/Kubernetes + massive configs to make things to build sucks the life out of every project for me that has them. And when it uses a non-git version control system.


Google's SRE books cover a lot of the things that large teams managing large distributed systems encounter and how to tackle it in a way that doesn't burn out engineers. Depending on organization size/spread, follow-the-sun oncall schedules drastically reduce burnout and apprehension about outages. Incident management procedures give confidence when outages do happen. Blameless postmortems provide a pathway to understanding and fixing the root causes of troublesome outages. Automation reduces manual toil. Google SRE has been keeping a lot of things running for a decade or more and has learned a lot of lessons. I did that from 2014 to 2018 and it seemed like a pretty mature organizational approach, and the books document essentially that era.


My take is that it's exhausting because everything is so damn SLOW.

"Back to the 70's with Serverless" is a good read:

https://news.ycombinator.com/item?id=25482410

The cloud basically has the productivity of a mainframe, not a workstation or PC. It's big and clunky.

----

I quote it in my own blog post on distributed systems

http://www.oilshell.org/blog/2021/07/blog-backlog-2.html

https://news.ycombinator.com/item?id=27903720 - Kubernetes is Our Generation's Multics

Basically I want basic shell-like productivity -- not even an IDE, just reasonable iteration times.

At Google I saw the issue where teams would build more and more abstraction and concepts without GUARANTEES. So basically you still have to debug the system with shell. It's a big tower of leaky abstractions. (One example is that I had to turn up a service in every data center at Google, and I did it with shell invoking low level tools, not the abstractions provided)

Compare that with the abstraction of a C compiler or Python, where you rarely have to dip under the hood.

IMO Borg is not a great abstraction, and Kubernetes is even worse. And that doesn't mean I think something better exists right now! We don't have many design data points, and we're still learning from our mistakes.

----

Maybe a bigger issue is incoherent software architectures. In particular, disagreements on where authoritative state is, and a lot of incorrect caches that paper over issues. If everything works 99.9% of the time, well multiple those probabilities together, and you end up with a system that requires A LOT of manual work to keep running.

So I think the cloud has to be more principled about state and correctness in order not to be so exhausting.

If you ask engineers working on a big distributed system where the authoritative state in their system is stored, then I think you will get a lot of different answers...


It's okay to prefer working on small single server systems with small teams for example. I do this while contracting quite often and enjoy how much control you get to make big changes with minimal bureaucracy.

Sometimes it feels like everyone is focused on eventually working with Google scale systems and following best practices that are more relevant towards that scale but you can pick your own path.


Humans GET simplicity from extreme hyper complexity.

Take a gas generator. Easy, add oil and gas and get electricity and these days they even come in a smoothed over plastic shell that makes it look like a toy. Inside, very complex, spark plugs, engine, coils, inverter. A hundred years of inventions packed into a 1.5' x 1.5' box.

It's the same thing for complicated systems. Front end to back. No matter how ugly or how much you wish it was refactored - some exec knows it as a box where you put something in and magical inference comes out. Maybe that box actually causes real change in the physical world - like billions of packages being sent out all over the world.

In the days of castles you would have similar systems managed by people. People that drag wooden carts of shit out of a castle. Carrying water around. Manually husking corn and wheat and what have you.

No matter how far into the future we go, we will continue to get simple out of monstrous complexity.

That's not the answer to your question - but it's just that the world will always lean towards going that way.


Handling scale is a technically challenging problem, if you enjoy it - then take advantage! however sometimes taking a break to work on something else can be more satisfying.

Typically on a "High scale" service spanning hundreds or thousands of servers you'll have to deal with problems like. "How much memory does this object consume?", "how many ms will adding this regex/class into the critical path use?", "We need to add new integ/load/unit tests for X to prevent outage Y from recurring", and "I wish I could try new technique Y, but I have 90% of my time occupied on upkeep".

It can be immensely satisfying to flip to a low/scale, low/ops problem space and find that you can actually bang out 10x the features/impact when you're not held back by scale.

Source: Worked on stateful services handling 10 Million TPS, took a break to work on internal analytics tools and production ML modeling, transitioning back to high scale services shortly.


I'm trying to relate this to my experiences. The best I can make of it is that burnout comes from dealing with either the same types of problems, or new problems at a rate that's higher than old problems get resolved.

I've been in those situations. My solution was to ensure that there was enough effort into systematically resolving long-known issues in a way that not only solves them but also reduces the number of new similar issues. If the strategy is instead to perform predominantly firefighting with 'no capacity' available for working on longer term solutions there is no end in sight unless/until you lose users or requests.

I am curious what the split is of problems being related to:

1. error rates, how many 9s per end-user-action, and per service endpoint

2. performance, request (and per-user-action) latency

3. incorrect responses, bugs/bad-data

4. incorrect responses, stale-data

5. any other categories

Another strategy that worked well was not to fix the problems reported but instead fix the problems known. This is like the physicist looking for keys under the streetlamp instead of where they were dropped. Tracing a bug report to a root cause and then fixing it is very time consuming. This of course needs to continue, but if sufficient effort it put to resolving known issues, such as latency or error rates of key endpoints, it can have an overall lifting effect reducing problems in general.

A specific example was how effort into performance was toward average latency for the most frequently used endpoints. I changed the effort instead to reduce the p99 latency of the worst offenders. This made the system more reliable in general and paid off in a trend to fewer problem reports, though it's not easy/possible to directly relate one to the other.


Using micro-services instead of monoliths is a great way for software engineers to reduce the complexities of their code. Unfortunately, it moves the complexity to operations. In an organization with a DevOps culture, the software engineers still share responsibility for resolving issues that occur between their micro-service and others.

In other organizations, individual teams have ICDs and SLAs for one or more micro-services and can therefore state they're meeting their interface requirements as well as capacity/uptime requirements. In these organizations, when a system problem occurs, someone who's less familiar with the internals of these services will have to debug complex interactions. In my experience, once the root-cause is identified, there will be one or more teams who get updated requirements - why not make them stakeholders at the system-level and expedite the process?


> Using micro-services instead of monoliths is a great way for software engineers to reduce the complexities of their code

Could you share why you think that's true?

IMO that it's exactly the opposite - microservices have potential to simplify operations and processes (smaller artifacts, independent development/deployments, isolation, architectural boundaries easier to enforce) but when it comes to code and their internal architecture - they are always more complex.

If you take microservices and merge them into a monolith - it will still work, you don't need to add code or increase complexity. You actually can remove code - anything related to network calls, data replication between components if they share a DB, etc.


In all the situations i have had to work on microservices it generally means the team just works on all the different services, now spread out over more applications. Doing more integration work vs actual business logic. Because the fancy microservices the architect wanted doesn't mean there's actually money to do it properly or even have an ops team.

Also for junior team members a lot of this stuff works via magic because they can't yet oversee where the boundaries are or do not understand all the automagically configuration stuff.

Also the amount of works on my machine with docker is staggering even if the developers laptop's are the same batch / imaged machine.


One problem I frequently see with distributed systems is not the amount of services and the distributed nature per se.

Rather that it allows, and tempts, you to use the perfect tool for each job. Leading to a lot of variations in your stack.

Suddenly you have 5 different databases, 3 RPC protocols, 4 programming languages and 2 operating systems spinning around in your cluster. Only half of them connected to your single sign on. And don’t forget about all the cloud dependencies.

If any one of them starts misbehaving you have to read up “how did I attach debugger to Java process again”. How do I even log in to a mongodb shell? I installed pgadmin last week.

Standardize your stack and accept that some times it might mean using something slightly inefficient in the small scheme. In the big scheme it will make things more homogenous, unified and simpler for operators.


The most undervalued thing that forgot even highly skilled engineers - KISS principle. That’s why you are burning out supporting such systems.


Yes, it's amazing how much one modern high spec system running good code can do. Turn off all the distributed crap and just use a pair in leader/follower config with short ttl DNS to choose the leader and manual failover scripts. If your app/company/industry cannot accept the compromises from such a simple config, quit and work in one which can.


Good code? Where?

This whole thread feels like therapy since I face the same monsters on the systems I work on. Partly due to bad platform & code, partly due to bad organization structure (Conway's 100% for us).

My pet projects at home is the only thing keeping me sane, mostly because they are simple.


Yes, but in a different way. I work in Quality Engineering, and the scope of maturity in testing distributed systems has been exhausting.

Reading other comments from the thread, I see similar frustrations from teams I partner with. How to employ patterns like contact, hypothesis, doubles, or shape/data systems (etc.) typically gets conflated with System testing. Teams often disagree on the boundaries of the system start leaning towards System Testing, and end up adding additional complexity in tests that could be avoided.

My thought is that I see the desire to control more scope presenting itself in test. I typically find myself doing some bounded context exercises to try to hone in on scope early.


I so wish there were in-person meetups and conferences going on so I might have been nearby and overheard you saying that so I could try to join in the conversation. Sounds fascinating and just the sort of insight that doesn't come up in the entirely planned and scheduled zooms I'm usually in (and HN, for all its virtues, isn't really a substitute for a great conversation).


Yup. Spent more than a decade doing it. Got so frustrated that I started a company to try abstract it all away for everyone else. It's called M3O https://m3o.com. Everyone ends up building the same thing over and over. A platform with APIs either built in house or an integration to external public APIs. If we reuse code, why not APIs.

I should say, I've been a sysadmin, SRE, software engineer, open source creator, maintainer, founder and CEO. Worked at Google, bootstrapped startups, VC funded companies, etc. My general feeling, the cloud is too complex and I'm tired on waiting for others to fix it.


>Consume public APIs as simpler programmable building blocks

Is the 'r' in simpler there intentionally? In which way are the building blocks more simple than simple blocks?


Simpler than the public APIs


Mental / emotional burnout is certainly not uncommon in tech (probably in most other careers, I'd bet). Most people in Silicon Valley are changing jobs more often than 4-5 years. I don't like to constantly be the new guy, but there is a refreshing feeling to starting on something new and not carrying years of technical debt on your emotions. Maybe it's time to try something new, take a bigger vacation than usual, or talk to someone about new approaches you can try in your professional or personal life. But certainly don't let the fact that you feel like this add to the load - you're not alone, and it's not permanent.


I find it actually the other way around.

As you said, a benefit of large distributed systems is that usually its a shared responsibility, with different teams owning different services.

The exhaustion comes into place when those services are not really independent, or when the responsibility is not really shared, which in turn is just a worse version of a typical system maintained by sysadmins.

One thing that helps is bring the DevOps culture into the company, but the right way. It's not just about "oh cool we are now agile and deploy a few times a day", it's all down to shared responsibility.


It definitely can be. I'm constantly trying to push our stack away from anti-patterns and towards patterns that work well, are robust, and reduce cognitive load.

It starts by watching Simple Made Easy by Rich Hickey. And then making every member of your team watch it. Seriously, it is the most important talk in software engineering.

https://www.infoq.com/presentations/Simple-Made-Easy/

Exhausting patterns:

- Mutable shared state

- distributed state

- distributed, mutable, shared state ;)

- opaque state

- nebulosity, soft boundaries

- dynamicism

- deep inheritance, big objects, wide interfaces

- objects/functions which mix IO/state with complex logic

- code than needs creds/secrets/config/state/AWS just to run tests

- CI/CD deploy systems that don't actually tell you if they successfully deployed or not. I've had AWS task deploys that time out but actually worked, and ones that seemingly take, but destabilize the system.

---

Things that help me stay sane(r):

- pure functions

- declarative APIs/datatypes

- "hexagonal architecture" - stateful shell, functional core

- type systems, linting, autoformatting, autocomplete, a good IDE

- code does primarily either IO, state management, or logic, but minimal of the other ops

- push for unit tests over integration/system tests wherever possible

- dependency injection

- ability to run as much of the stack locally (in docker-compose) as possible

- infrastructure-as-code (terraform as much as possible)

- observability, telemetry, tracing, metrics, structured logs

- immutable event streams and reducers (vs mutable tables)

- make sure your team takes time periodically to refactor, design deliberately, and pay down tech debt.


Only read the transcript but I'm not getting most of it. I mean it starts with a bunch of aphorisms we all agree with but when it should be getting more concrete it goes on with statements that are kind of vague.

E.g. what exactly does it mean to: >> Don’t use an object to handle information. That’s not what objects were meant for. We need to create generic constructs that manipulate information. You build them once and reuse them. Objects raise complexity in that area.

What kind of generic constructs?


I agree with most of you points, but the one that stands out is "push for unit tests over integration/system tests wherever possible".

By integration/system tests, do you mean tests that you cannot run locally?


Most of that I agree with, I'm curious why you'd recommend unit tests over integration tests? It seems at odds with the direction of overall software engineering best practices.


I wrote such a system. 6+ years, between end of '07 to beginning of '14. It grew organically, with more and more end points as time went by, and when I exited the project it had over 250 end points, each having hundreds of thousand of users requests per day. By your measurement, that would mean the system I wrote would've handled in a month a total of 250 (end points) x 30 (days) x ~400k (requests per day) == 3B user requests in a month.

To my knowledge the system is still used to this day and I think it grew 10x meanwhile, so I think it's serving over 30B requests each month.

That being said, to answer your question - Yes! I got tired of it, started to plateau and felt I was lagging behind in terms of keeping up with technology around me. So I exited but at the same time I also started to get involved in other projects as well. So in the end I was overworked and I ditched the biggest project of my entire career as freelancer because payment was not worth anymore. I wanted to feel excited and the additional projects eventually made up in terms of money, but boy oh boy! the variation is what made me not feeling burnout. Nowadays if I feel another project is going that route I discuss with client to replace me with a team once I deliver the project in a stable state and for horizontal scaling.


Worked on a team at BofA, our application would handle 800 million events per day. The logic we had for retry and failure was solid. We also had redundancy across multiple DCs. I think we processed like 99.9999999% of all events successfully. (Basically all of them, last year we lost about 2,000 events total) I didn’t find it very stressful at all. We build in JMX Utica for our production support teams be able to handle practically anything they would need to.


Utils*


TLDR; Yes, it is exhausting, but I have found ways to mitigate it.

I don't develop stuff that runs billions of queries. More like thousands.

It is, however, important infrastructure, on which thousands of people around the world, rely, and, in some cases, it's not hyperbole to say that lives depend on its integrity and uptime.

One fairly unique feature of my work, is that it's almost all "hand-crafted." I generally avoid relying on dependencies out of my direct control. I tend to be the dependency, on which other people rely. This has earned me quite a few sneers.

I have issues...

These days, I like to confine myself to frontend work, and avoid working on my server code, as monkeying with it is always stressful.

My general posture is to do the highest Quality work possible; way beyond "good enough," so that I don't have to go back and clean up my mess. That seems to have worked fairly well for me, in at least the last fifteen years, or so. Also, I document the living bejeezus[0] out of my work, so, when I inevitably have to go back and tweak or fix, in six months, I can find my way around.

[0] https://littlegreenviper.com/miscellany/leaving-a-legacy/


Front end and no dependencies, tell us more


Feel free to see for yourself. I have quite a few OS projects out there. My GH ID is the same as my HN one.

My frontend work is native Swift work, using the built-in Apple frameworks (I ship classic AppKit/UIKit/WatchKit, using storyboards and MVC, but I will be moving onto the newer stuff, as it matures).

My backend work has chiefly been PHP. It works quite well, but is not where I like to spend most of my time.


I think there are a lot of strategies for dealing with the kinds of issues you're working with, but a lot of them involve building a good engineering culture and building a disciplined engineering practice that can adapt and find best scalability practices at that level.

We do billions of requests a day on one of the teams that I manage at work, and that team alone has sole operational and development responsibility for a large number of subsystems to be able to manage the complexity that a sustained QPS of that level requires. But those subsystems are in turn dependent on a whole suite of other subsystems which other teams own and maintain.

It requires a lot of coordination with a spirit of good-will and trust among the parties in order to be able to develop the organizational discipline and rigor needed to be able to handle those kinds of loads without things falling over terrible all the time and everybody pointing fingers at each other.

But! There are lots of great people out there who have spent a lot of time figuring out how to do these things properly and that have come up with general principals that can be applied in your specific circumstances (whatever they may be). And when executed properly I would argue that these principals can be used to mitigate the burnout you're talking about. It's possible to make it through those rough spots in an organization (that frequently, though not always, come from quick business scaling -- i.e. we grew from 1000 customers to 10,000 last year) etc.

If you're feeling this kind of feeling and the organization isn't taking steps to work on it, then there are things you can do as an IC to help, too. But this is all a much longer conversation :)


Yes it’s horrible. I actually miss the early 00’s when I did infra and code for small web design agencies. I actually could complete work back then.


Quite the opposite, interestingly, I’m usually in “Platform”-ish roles which touch or influence all aspects of the business, inc. building and operating services which do a couple orders of magnitude more than OP’s referenced scale (in the $job[current] case, O(100B - 1T) requests per day) and while I agree with the “Upside” (career progression, intellectual interest, caliber of people you work with), I haven’t experienced the burnout and in 2022 am actually the most energized I’ve been in a few years.

I expect you can hit burnout building services and systems for any scale and that’s more reflective on the local environment — the job and the day to day, people you work with, formalized progression and career development conversations, the attitude to taking time off and decompressing, attitudes to oncall, compensation, other facets.

That said, mental health and well-being is real and IMO needs to be taken very seriously, if you’re feeling burnout, figuring out why and fixing that is critical. There have been too many tragedies both during COVID and before :-(


My number one requirement for a distributed system is that the code all be one place.

There are good reasons for wanting multiple services talking through APIs. Perhaps you have a Linux scheduler that is marshalling test suites running on Android, Windows, macOS and iOS?

If all these systems originate from a single repository, preferably with the top level written in a dynamic language that runs from its own source code, then life can be much easier. Being able to change multiple parts of the infrastructure in a single commit is a powerful proposition.

You also stand a chance of being able to model your distributed system locally, maybe even in a single Python process, which can help when you want to test new infrastructure ideas without needing the whole distributed environment.

Your development velocity will be faster and less painless. Changes being slow and painful are what burn people out and grind progress to a halt.


> My number one requirement for a distributed system is that the code all be one place.

This is a major source of frustration. Having to touch multiple repositories and syncing and waiting for their deployment/release (if it's a library) just to add a small feature easily wastes a few hours of the day and most importantly drains cognitive ability by context switching.


I find it very draining and vexing to work on systems that have all of its components distributed left and right without clear boundaries, instead of being more coalesced. Distribution in the typical sense - identical spares working in parallel for the sake of redundancy - doesn't faze me very much.


It’d be interesting to know - what are the expectations made of you? In this environment, I’d expect there to be dedicated support for teams operating their services - i.e. SRE/DevOps/Platform teams who should be looking to abstract away some of the raw edges of operating at scale.

That said, I do think there’s a psychological overhead when working on something that serves high levels of production traffic. The stakes are higher (or at least, they feel that way), which can affect different people in different ways. I definitely recognise your feeling of exhaustion, but I wonder if it maybe comes from a lack of feeling “safe” when you deploy - either from insufficient automated testing or something else.

(For context - I’m an SRE who has worked in quite a few places exactly like this)


Let's set aside the "distributed" aspect. To effectively scale a team and a code base you need some concept of "modularization" and "ownership". It is unrealistic to expect engineers to know everything about the entire system.

The problem is that this division of the code base is really hard. It is really hard to find the time and the energy to properly section your code base in proper domains and APIs. Especially with the constantly moving target of what needs to be delivered next. Even in a monorepo it is exhausting.

Now, put on top of that the added burden brought by a distributed system (deployment, protocol, network issues, etc) and you have something that becomes even more taxing on your energy.


Depends. Not the systems themselves but more the scope of the work and how it is being done. If the field is boring or the design itself is bad(with no ability to make it better, whether it's simply by design, code quality or whatever), my motivation, will and desire to work teleports to a different dimension-it's a fine line between exhaustion and frustration I guess. If it is something interesting, I can work on it for days straight without sleeping. Lately I've been working on a personal project and every time I have to do anything else I feel depressed for having to set it aside.


Can you say more? What specifically is exhausting?

Exhaustion/burnout isn't uncommon but without more context it's hard to say if it's a product of the type of work or your specific work environment.


This is on point... You also give no actual numerical context. Are you saying you are working 40 hours a week and leave work exhausted? Are you saying you work 40 at work, and are on call/email/remote terminals for 40 more hours coordinating teams, putting out fires, designing architecture?

Even then, I would ask you to be more specific. I have a normal 40 hour a week uni job as a sysadmin, but it typically takes somewhat more or less (hey, sometimes I can get it done in 35, sometimes its 50 hours). However, for the last several years we have been so shorthanded, faculty wise, that I teach (at a minimum) two senior level computer science classes every semester (I was a professor at another uni). About mid semester, things will break, professors will make unreasonable demands of building out new systems/software/architecture, and I find myself doing (again at a minimum) 80 hours a week. On the other hand, I am not exhausted, as I enjoy teaching quite a bit, and I have been a sysadmin for many years and also enjoy that work.


As you imply towards the end, I think things like numbers of hours worked are generally not relevant for stuff like this. I've been incredibly engaged working 12+ hour days and I've been burnt out barely getting 2-3 hours of real work in a day. It has more to do with the nature of the work.


Even though you only did 2-3 hours of "real work", how much actual time investment was in your job? I don't see how somebody can burn out working just 2-3 hours in a day. Maybe emotionally burnt out if you're a therapist or something, but not as a software engineer.


I guess it was more like that I got burnt out from other things and ended up only being able to get myself to work 2-3 hours.

That said, I wasn't working especially long hours before, either. Maybe not 2-3 hours but still sub-8. The burnout definitely wasn't caused by long hours.

'Burnout' is a pretty ambiguous word IME but in its most commonly used sense it's pretty unrelated to hours worked. My favorite definition is that burnout is a "felt loss of impact and/or control".


Yes the complexity and scale of these systems is far beyond what companies understand. The salaries of engineers on these systems need to double asap or they risk collapse.


This post resonates with me. I recently join a big organisation and a team owning such a system. The oncalls are very stressful to me. Our systems aren't that robust and we don't have control on all the dependencies. So things fail all the time. At the same time, management is consistently pushing for new features. As a consequence, work life balance is bad, turnover is high.

My hope is that I'll learn to manage the stress and gain more expertise.


Is it really the distributed aspect? Or "just" working on a above average complicated project for many years?

The consequences of bugs in many distributed systems (and several other types of systems) are IME often harder to bear than e.g. UI or frontend workflow bugs. It's hard to have caused data loss. And at some point you probably will, even if you're quite careful.

Maybe I'm just projecting...


Yes, it's part of why I'm a dad at home who works on a little bash scripting sysadmin work as a side job.

Everything has gotten too complicated and slow.


If you're working on distributed systems scheduling and orchestration, then yeah it's exhausting. I did it for six years as a SRE-SE and am now back to being a SWE on a product team. If you like infrastructure stuff without having responsibility for the whole system the way that scheduling and orchestration makes you, then look at working on an infrastructure product.


I think our field is so broad that it is somewhat nebolous to talk about the average engineer. But from my experience taking car of such a large system with a large amount of requests and complexity is outside of what is expected of an average engineer. I think that there is an eventual limit for how much complexity a single engineer can handle for several years.


Relevant comedy video:

https://www.youtube.com/watch?v=y8OnoxKotPQ

This recent video they put out is pretty good, too:

https://www.youtube.com/watch?v=kHW58D-_O64


I have 15 years xp in dev but all of that was in smaller projects and a small team. I recently took a gig in bigger org with a distributed system and on call etc. It's exhausting and information overload. I'll give myself more time to acclimate but if I feel like this still after a year I'm out.


I can see how it'd be exhausting to have to deal with the responsibility for the entirety of a few services.

A key part of scaling at an org-level is continuously simplifying systems.

At a certain level of maturity, it's common for companies to introduce a horizontal infra team (that may or may not be embedded in each vertical team).


It's not so much the systems, but the organizations which create systems in their own image so to speak. If making changes is hard, either in the organization or within teams, you better believe any changes to a distributed system will be equally tough to implement.


I did at first, but then learning config management and taking smaller bites helped.

I started out as a systems administrator and it's evolved into doing that more and faster. The tooling helps me get there, but I did have to learn how to give better estimates.


I actually love it and the more complex the system the better. I have been doing it for more than 10 years now and everyday I learn something new from the legacy and the replacement that we work on


I don't really work on distributed systems but I do often worry about performance and reliability and even if I get some wins sometimes the anxiety of not performing right is stressful....


Yes. But remember, with tools and automation getting better, this is a major source of value add that you bring as a software engineer which is likely to have long term career viability.


I think I understand what you mean, but it’s hard for me to contextualize, because I’m still working through some of my own past to identify where some of my burn out began.

For my part, I love working at global scale on highly distributed systems, and find deep enjoyment in diving into the complexity that brings with it. What I didn’t enjoy was dealing with unrealistic expectations from management, mostly management outside my chain, for what the operations team I led should be responsible for. This culminated in an incident I won’t detail, but suffice to say I hadn’t left the office in more than 72 hours continuous, and the aftermath was I stopped giving a shit about what anyone other than my direct supervisor and my team thought about my work.

It’s not limited to operations or large systems, but every /job/ dissatisfaction I’ve had has been in retrospect caused by a disconnect between what I’m being held accountable for vs what I have control over. As long as I have control over what I’m responsible for, the complexity of the technology is a cakewalk in comparison to dealing with the people in the organization.

Now I’ve since switched careers to PM and I’ve literally taken on the role of doing things and being held responsible for things I have no control over and getting them done through influencing people rather than via direct effort. Pretty much the exact thing that made my life hell as an engineer is now my primary job.

Making that change made me realize a few things that helped actually ease my burn out and excite me again. Firstly, the system mostly reflects the organization rather than the organization reflecting the system. Secondly, the entire cultural balance in an organization is different for engineers vs managers, which has far-reaching consequences for WLB, QoL, and generally the quality of work. Finally, I realized that if you express yourself well you can set boundaries in any healthy organization which allows you to exert a sliding scale of control vs responsibility which is reasonable.

My #1 recommendation for you OP is to take all of your PTO yearly, and if you find work intruding into your time off realize you’re not part of a healthy organization and leave for greener pastures. Along the way, start taking therapy because it’s important to talk through this stuff and it’s really hard to find people who can understand your emotional context who aren’t mired in the same situation. Most engineers working on large scale systems I know are borderline alcoholics (myself too back then), and that’s not a healthy or sustainable coping strategy. Therapy can be massively helpful, including in empowering you to quit your job and go elsewhere.


Often when I hear stories of billions of requests per second it's self inflicted because of over complicated architecture where all those requests are generated only by a few thousand customers... So it's usually a question of how the company operate, do you constantly fight fires ? or do you spend your time implementing stuff that have high value for the company and it's customers ? Fighting fires can get your burned out (no pun intended) while feeling that you deliver a lot of value will make you feel great.


> billions of requests per second

Op said "billions of requests per month".

That's ~thousands of qps.


That's nothing.



Yes. That's why you avoid building them unless you absolutely need to, and instead build libraries instead.


Yes, a bit. But it's fun. And the motivation of fun is hardly to find in a big monothlic system.


it's exhausting but can be fun if you have a competent team to support you. I like nothing more than being told "one TPU chip in this data center is bad. Find it efficiently at priority 0."


I find any work exhausting


I find working on single services / components more exhausting.


You are right, I work for a FAANG on one such system and it’s hard.


If you're burnt out, you're most likely being suckered.


Not so all? Stuff is usually fixable.

Org and people are not.


lets say for the argument sake it's 50 billion thats 20k/sec there is zero need to for a fancy setup at this scale


I am not sure you are aware that server load is never linearly distributed. And that's the exact problem OP is talking about.

If everybody would get a ticket number and do requests when they're supposed to do them, we wouldn't need load balancers.


This is orthogonal to what causes the pain point. All the pain comes from distributed state and these load levels even if you peak at 800K requests per second you don't need distributed state. So most of this pain is self inflicted.


True, I agree.

In most systems I've seen, the caching layer is invalidated more often than necessary, and most of the traffic could've been avoided with a better URL scheme that's more expressive in regards to its content (mutations).


It's only exhausting when you know deep in your heart that this could run on one t2 large box.


I think it's more likely Zeitgeist. You see, someone else finds working in Data Science frustrating, another person nearing his 40 says he's anxious about his career, another guy says he's worried about it's too late to do something about the big tech messing up the field etc.

I've had similar issues recently working at a demanding position I didn't really like even though my achievements may look impressive in my resume. I tried working in a shop somewhere in between aerospace and academia but just didn't fit at all. I ended up joining a small team that I enjoy working with so far and feel much better now.

At a higher level, we're hitting the limits of current paradigm in many ways including monetary system (debt), environment (pollution) and natural resources, ideology (creativity and innovation), technology (complexity).

The good news is that this year current monetary system will cease to exist. This will eventually restructure the economy to a more healthy balance. Unfortunately, this will have severe social consequences as standard of living will change dramatically (somewhere at the 60's level). This will basically destroy the middle class and thus change the structure of consumption. Obviously, this will mostly affect services and other non-essential stuff we got used to. On the other hand, this will blow down all bloat like insane market cap of the big tech etc. That is working in IT may become fun again, like 20 years back :)


The end is near?


Looks like that. I mean has anything fundamentally changed since 2008? No. The 'raegonomical' approach has been eating the future demand for 40 years. And what is the dollar issue volume in 2020-2021 compared to preexisting volume? And what's been happening to PPI[1] since 2020? I guess we should ask the Fed about that.

[1] https://fondmx.pro/wp-content/uploads/2022/02/image-174.png




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: