Many commenters here are rightly pointing out Google’s hypocrisy in actually following the principles in this article. Fair enough. But others are throwing the baby out with the bathwater: it’s a little silly to read comment after comment saying that the advice in TFA must be bad because Google does dumb/bad stuff on the regular. Companies aren’t homogenous. Even misguided companies may employ people who can teach others important things.
Boeing is a perfect example of this. I would absolutely read an article proposing principles of engineering reliability from a Boeing eng/QA greybeard. Even as the rest of the company spiraled due to horrible leadership and management practices, many people in engineering and quality control did their damnedest to keep those failures from causing even more harm and loss of life. Those people probably have very valuable lessons to share about how to maintain what quality you can in a deeply hostile environment.
Also Google has this problem after they outsourced a bunch of work to third world countries where original thinking is quite limited and management through bureaucracy is the norm.
Indeed, allegations of hypocrisy are a class of ad hominem. They don't necessarily weigh in on the validity. It just... feels good? I guess? People LOVE to feel like they caught a hypocrite. It's probably in the Top 5 most sought after dopamine kicks.
While the text touches on many points I would immediately sign, the paragraph starting with "Because engineers are human beings who often form an emotional attachment to their creations, ..." is really out of place.
The cause of complexity is not emotional attachment, these are decisions being made. The decision to add feature after feature and punt on maintenance for example is something that has little to do with emotions. There is a lot of agency that engineers, SWE and SRE alike have in shaping how things are. However there can be good reasons to abandon simplicity. The real trouble here is not psychology but that as a profession we are really bad at measuring and estimating the effective cost of maintenance. Part of that is considering measures to improve simplicity and maintainability as cost that comes without gain and somehow less important than features, and then just accept giant rewrite a few years later. A continuous portion of upkeep would likely be more economical and real engineering has always included an aspect of economy - cost vs benefit.
IMHO the loaded accusation of emotional attachment might be rooted in an "us vs them" attitude (SRE vs software engineering) that should have no place in a sober discussion on the value of simplicity and it diminishes an otherwise great text.
>> Because engineers are human beings who often form an emotional attachment to their creations, confrontations over large-scale purges of the source tree are not uncommon. Some might protest, "What if we need that code later?"
> the paragraph starting with "Because engineers are human beings who often form an emotional attachment to their creations, ..." is really out of place.
FWIW I’ve definitely encountered developers clinging to things when the business context has completely changed. I totally recognise the scenario in the original text.
Sure, but if we argue that these values and principles should be applicable, then it should also be possible to make an argument why and not blame the irrationality on emotions.
It seems more likely that bounded rationality is at play here, where different parties only know part of the picture (and fail to bring these together and find out what would be best globally.)
I don’t follow why we shouldn’t blame the irrationality on emotions. Emotions are massively important, and people do irrational things because of them all the time. Why pretend that’s not true?
Fear, I think. It's unpleasant to know someone is harming the common good because they're being selfish. Some people can handle that knowledge and admit "well someone clearly owed their promotion to a legacy module/abandoned code/unused system". Whereas others are that someone or are afraid of that someone, and they make up a bunch of excuses.
This dynamic comes up often in engineering management.
The question is not whether emotions can cause people to be irrational. They can!
Not every case of irrational behavior is caused by emotions though. And when we are making an argument that people are acting against their own interests, it may help to ponder what makes them do so. All the more when we are claiming principles and values that should be accepted by everyone.
"If you don't believe me you are acting irrational / it's because you are emotionally attached" does not seem to be an attitude that gets closer to real causes in a discussion on how to best seek simplicity, but rather a recipe for avoiding discussion or a "thought-terminating cliche."
There must be a better argument for convincing people to let go of code / clean up etc.
But, if it is the truth, then it seems to me like it does get us closer to the truth. I don’t think irrational means there is no explanation for the behavior, it’s closer to saying that the behavior is not in anyone’s best interests, it’s not thought through.
Clinging to code you wrote is a very natural thing to do, there are many reasons we do it, and most of the time it’s irrational.
I'm a SRE and I disagree too, though, I think you're giving SREs too much credit in the category of our hegemony for an "us vs them" debate. Maybe at Google SWEs having relationships with their code based is a well studied thing. It could also just be someone's opinion that managed it's way unchallenged into the book. That's to say, Google SRE wasn't the best or last iteration of SRE.
I personally think systems evolve the way you describe because of a system of incentives. There are more incentives for features than there exist for refactor and non top priority defect fixes. This comes from the people who hold power to shape incentives and they often do so with conflicting priorities and superficial understandings of the existing incentive structure.
I'd also like to say that it's my own personal theory that systemic issues can only be caused by systemic forces. Individual mindsets cannot be to blame then; if a mindset has become systemic (example: SWEs overly attached to code and features) then your next question should be "why?". There's a system that enforces that, and if you don't look beyond personal obsession then you'll never find it.
I like this way of saying it. I don't think anything here is well studied at all. It is not like we are all fishing in the dark but the organizational structures that determine the conditions in which software development and operations happen are not well understood. I found Herb Simon's writings and his concept of bounded rationality very lucid.
When we shift from "reliability" to "safety" we also need to shift from the individual to the system.
But people do get attached to their creations, they don't want their things deprecated/removed, since to them it may feel like their thing is thrown away or wasted work down the drain.
While they may not obviously state it as such, it can be the underlying reason driving their arguments (e.g. sunk cost fallacy).
Maybe this is also about the desire to create, which of course is also common in engineering. It does not contradict my argument that the cost of maintenance and operations is being ignored eg when one creates things all the time and never removes stuff. And it should be possible to measure or estimate that cost.
I think the examples the paragraph gives more than backs up the statement. I’ve met people who comment out code instead of deleting it (luckily not in a long time!) and I feel the authors speak from experience here.
Curious what examples do you see there. I don't doubt the experience.
When I draw analogies of my past experiences to present situations, that does not mean that my past experiences are the best way to convince people of what is the right thing to do. I still need to do the hard work of pointing out what it is that is in the common interest and why eg deleting stuff and simplifying is good.
In such a discussion it won't help me to say people who disagree with me are generally just emotional, does it? Even if I may have encountered people with such emotional reactions.
Unfortunately complex solution we have accumulated over time is usually because business did not want to spend a bit more up front to come up with cleaner solution.
In the same way business is also very reluctant to spend time/money on cleaning up stuff.
I never ever had to make up complex stuff on my own. It always happens on its own.
I think that is being transparent with what actually happens in the real world (engineers, at least in part, being human and emotional in their decisions), rather than just talking about impossible ideals (engineers thinking about tradeoffs in a purely objective matter).
NIH, CV based development, preference for shiny/new things and a myriad of other "engineer/organizational diseases" exist, you know. And there are even SaaS/PaaS/XaaS marketing teams exploiting such human qualities when making software sales.
No, that had nothing to do with emotional attachment. It’s a short phrase to remind people that they can’t make each device special with one-off because it needs to be repeated/destroyed all of the time.
Separately, cattle vs pets is much older than containers. It got popular with ephemeral EC2 instances when people were first forced to grapple with lifetimes of VMs measured in hours and the ability to scale massively as needed.
A little of both, I think. I remember having decommissioning ceremonies for long-lived, specifically-named servers, and I remember the era when people were proud of astoundingly long uptimes. Both of those things are aspects of pet-hood that treating servers like cattle changed.
I never took that as dealing with emotional attachment, it was just a shorthand to express that at any moment you would kill cattle so don't do things you can't easily replicate.
Just remember that what google writes in these kind of things is not universal. It's written from their very unusual circumstances. You can certainly pick nuggets that are more universal than others but, like in many other instances, too much unnecessary work is spent trying to imitate Google and others when it's not really needed. And no, you won't turn into Google over night, you will have time to adapt if fortune hits you. Some things are not even necessarily good advice at all, but rather a product of incentives within Google (and perhaps most tech corps) rewarding the aesthetics of "innovation".
"If you are only one developer" suggests zero interests in being nuanced.
To be clear, this linked specifically to simplicity which I'm certainly in favor of emphasizing the importance of. But IME the exact opposite happens when people try to imitate Google overall in a smaller setting, where instead too much resources are spent on meta-issues instead of the product being developed.
I think you’re arguing with someone who isn’t here.
Nobody is endorsing the practices in TFA “because it’s Google”/in order to be like Google. Sure, people elsewhere make those claims all the time, and they’re wrong, but that’s not in evidence here that I can see.
The article does seem to come pretty close to universally applicable good ideas. Not because of where its author works, but because of the content.
> Nobody is endorsing the practices in TFA “because it’s Google”/in order to be like Google. Sure, people elsewhere make those claims all the time, and they’re wrong, but that’s not in evidence here that I can see.
I disagree, I think we can see this time and time a again. YMMV I guess. It's an encouragement to be vigilant for over-engineering when you don't need it because you're not google. I'm not saying that the content is bad, it's a worthy read. Just don't get overeager like the OOP craze phase where would attempt to bend everything into a maze of design pattern because people took whatever books they read way too far. Most of the chapters have YAGNI parts for smaller settings, but it's still worth knowing about what the next steps are.
Even within Google, this is not universal. I doubt the majority of SREs at Google have even read the "Google SRE book".
On the other hand, the book has some nuggets that make it worth reading. But it should be treated as a collection of essays from some very senior SREs rather than a manual.
> I doubt the majority of SREs at Google have even read the "Google SRE book".
That's absolutely true, but by design. SRE already had exceptional horizontal knowledge transfer before any book. The book was published (specifically, published) to extend that knowledge transfer outside of Google's own walls so the rest of the industry could also benefit.
I was a SWE-SRE for several years and absorbed a lot just through talks and postmortems. I left Google and joined another big org that was a ... bit ... less far along in its SRE ambitions. I couldn't convince a single person to read the book, even after linking to specific sections to explain why e.g. jitter is important to have in tandem with backoff. Nobody cares, they have boxes to tick and janky bullshit to ship.
Most people are not interested in learning from books these days, inside or outside Google, but at least inside Google you can learn from an unbroken lineage of experienced SREs.
If you're trying to build out an SRE org in your own company, you're better off hiring one ex-Google Senior SRE than you would be if you bought everyone a copy of the book and two weeks of formal training. Actually embedding real-world experience into your teams is the most effective form of knowledge transfer available.
At the last “real” job I tried to help implement this as part of and later the manager of the ops team. It’s a great start, but in that case management wanted the idea of devops/sre but didn’t actually support it, and it really was a shit show. If you have a bad CTO and leadership on the board level, no amount of re-tooling will paper over their lack of support for the real principles.
Glad to see those valuable principles written, even if it seems we are heading in the complete opposite. At least we can try to apply them on our side business.
These were also true in the early ages of aviation:
“Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.”
> Source control systems make it easy to reverse changes
I have not observed this to be the case. After a few revisions there are so much changes that the code cannot be reversed without loosing a lot. A mech aims to cut out the soon-to-be-dead code like a flag is better. But perhaps maybe I’m doing something wrong.
A lot of preaching but bears little resemblance to what Google is actually doing in reality. IMHO those who actually understand what "simplicity" means in software are only those who have tried to do anything in highly-resource-constrained environments.
A taxonomy of what we mean when we talk about "resource-constrained" might be helpful for those seeking to gain this knowledge. Limited CPU, RAM, etc are the obvious contenders - but then there's also "resource-constrained" as in "I'm the solo dev of this project and have 5 hours in a good week to work on it", or "this runs in a weird place without Internet that I only get access to twice a year". I've been in all of these situations, sometimes multiple at the same time, and they've been great forcing functions to find new paths towards simplicity.
You also have to keep in mind the scope and timeline of where these principles apply. I'm sure someone would be able to apply them to their own work most of the time but if you look at a company as a whole, unless someone at the top is really pushing for global simplicity, things are pretty messy most of the time.
I'm just saying this because Google might be doing this in little islands, not as a company strategy. I don't really know and can only guess from the outside.
"Why don’t we gate the code with a flag instead of deleting it?" These are all terrible suggestions. Source control systems make it easy to reverse changes, whereas hundreds of lines of commented code create distractions and confusion."
In most cases to delete code would be a good idea, but to say that source control systems make reverting easier. After a few months most developers will have forgot about those lines and at times uncommenting code & explaining it explicitly might be a better way to preserve knowledge then to rely on digging through GIT.
I've seen it be a company culture thing where every discussion was resolved with we'll put it behind a config/flag. It's an easy way to avoid hard choices. It's probably something like that the author refers to.
This reads for me as a reflection of Google politics/org structure. The SRE org positioning itself as the guardian of system design vs. the SWEs who are agents of complexity. Doesn't feel healthy to me. The principles are fine but it's the SWEs that should be talking and applying them because they are "closer" to the decisions.
Maybe an unpopular opinion, but this type of content is useless and serves no other purpose than feeding the already bloated Google cargo-culting machine.
Not OP but I'll give my take because I mostly agree. Because Google lives in a world few of us do. I'm SRE/DevOps and our lives are nothing like Google SREs. We have almost zero control over software that is chucked our way. Any attempt to try and control them fails with management telling us "Just fucking ship it". Finally, something I realized after working with various FAANG SRE types, they don't understand what bad development practices look like, they can't imagine it.
Google invented the term SRE. And by your own words… “our lives are nothing like Google SREs”.
The whole point of google inventing a new title and team, from Ben Teynor’s mouth, was that ops should be superseded by a specialization of SWE called SRE.
If your company doesn’t support that, it’s not SRE.
Because is just useless. I mean seriously, what valuable insight does anyone get from that? It's some sort of truism wrapped in a word sandwich, ready for linkedin lunatics to pat themselves on the back sharing it. Do you feel you've gained something by reading it? Is this a valuable piece of intelligence which would guide your future decisions? Will you bring this to the team during an argument to push your agenda? This feels like the same type of 'feel good' content which people read and then feel like they did something productive. But I would argue that every piece of insight coming for a mega corp, valuable inside the mega corp is actually dangerous outside when people take it as dogma and try to apply it. SRE in general is something which IMHO of working in the industry for decades has poisoned the industry with half assed cargo cult implementations. But it has Google branding, so it must be valuable, hits hard for the fanboys and obviously can and should be applied in every company and every context.
I also find it ironic to see 'Simplicity' touted from the same people who let Kubernetes lose in the wild, but that's a different story for a different time
That's circular reasoning ("it's useless because it's useless").
If you haven't gained any insights from reading that content, maybe it doesn't apply to you or you don't know what you don't know.
> valuable inside the mega corp is actually dangerous outside when people take it as dogma and try to apply it.
mega corp or not, dogmatic principles are usually bad coming from anywhere. The SRE book contains insights that apply to startup, medium-sized companies and mega corps. It's not prescriptive for a reason.
I’ve honestly never worked on software in an environment where the advice in the article wasn’t important to keep in mind. Personal projects, single digit employee count startups, growth stage, ancient and slow moving Perl monolith shops … they all needed to keep the principles of simplicity, boringness (boring.tech is a great reiteration of this) and continually self-auditing to reduce inherited complexity in mind.
Whether or not Google interprets this advice in a sane way or whether they actually follow it are separate issues, but I think the advice is timely and (at least in my experience) important for many people to hear, regardless of where it’s author works.
Does this include instructions on accidentally deleting a customer's account? Because that's what Google does. I don't think I want to take any advice from Google on anything.
Your argument would be stronger if you could list a few cases like that latest high profile one where GCP deleted some enterprise customer's account. A single one won't cut it for "that's what Google does".
With Google, the deletions almost always are intentional, not accidental, and this is a huge problem with it. Google (not GCP) deleted ten years of my data without warning or notification or remorse or recourse even though I was doing nothing illegal. Amazon would never do something like it. To Google, once a customer or service becomes just 1% inconvenient, it's time to get rid of the customer or service. It's a very valid concern.
I challenge you to find an organization that has never made a mistake. Truth is the uptime and reliability of Google services is very good, while operating at huge scale. And I have no association with Google whatsoever.
Cloud computers are just someone's else computer. Amazon and Microsoft engineers can make the same mistake too. Take backups and test them regularly and you'll be OK.
SRE has got to be one of the organisations that have done the most damage in the big G. They were given a license to mandate things based on philosophical musings backed with no science, and they can decide what's best and should be done without any data, just based on feels. They also have a culture of misanthropy, patronization and contempt towards devs. From what I can tell anyway.
> they can decide what's best and should be done without any data, just based on feels.
The book is exactly the opposite of this. The Principles chapter alone talk about many things that involve actually dealing with numbers (SLO, measuring complexity, etc).
Be google SRE. Elite software engineer. Cool under pressure.
Pager goes off! Grab pixel. Press finger print reader until it lets me enter my passcode. Ack page. Put down whisky. Shake self. 5 minutes to be logged in and dealing with the problem.
Check alert, see playbook, ignore playbook. Check which cell the problem is in. Correlate with rollouts. See a match. Roll back poorly tested dev promo project. Charts recover. Alert not firing.
Google’s “best practices” lead them to deleting an entire customer’s $135 billion pension account [1]. I’m surprised anyone is still reading anything Google writes.
You’re assuming that those systems were all implemented to the letter of that guide. That’s never the case. Often these type of guidelines are written to address recurring problems found in an organization.
> If we should only read things written by organizations that make no mistakes, then we will never read anything.
That was a “mistake” that should not have even been possible. If the pension fund had not used a multi cloud strategy the entire business would have been lost. A mistake is not configuring Kafka correctly and losing some data, deleting an entire account should not be given a pass.
> The recent postmortem says they were able to recover from backups on gcp, so I don't think this is true.
“UniSuper, an Australian pension fund that manages $135 billion worth of funds and has 647,000 members, had its entire account wiped out at Google Cloud, including all its backups that were stored on the service. UniSuper thankfully had some backups with a different provider and was able to recover its data, but according to UniSuper's incident log, downtime started May 2, and a full restoration of services didn't happen until May 15.”
Google didn’t recover the data, the customer recovered their data from a different cloud provider.
> Any other customer using GCVE or any other Google Cloud service.
> The customer’s other GCVE Private Clouds, Google Account, Orgs, Folders, or Projects.
> The customer’s data backups stored in Google Cloud Storage (GCS) in the same region.
...
> Data backups that were stored in Google Cloud Storage in the same region were not impacted by the deletion, and ... were instrumental in aiding the rapid restoration.
Emphasis mine.
You're quoting, as far as I can tell, an ArsTechnica article that makes unsourced claims about backups being deleted, neither UniSuper's nor Google's previous statements ever mentioned anything about backups being deleted.
> were instrumental in aiding the rapid restoration.
I don’t call 13 days a rapid restoration. I also don’t trust Google’s post-mortem documentation more than an independent news organization to be honest about what really happened. Especially while Google is actively gaslighting their users about the errors in its AI search [1].
It is, we'll go with, weird, to presume a random news article making baseless claims is correct over the, like, actual people who addressed the problem.
I'll reiterate, no one involved in the restoration (Unisuper or Google) ever said anything about Google's backups being deleted, in fact basically everything Google and Unisuper have said specific that it was only the VM config that was removed. Ars made up the thing about backups being deleted, which makes an exciting headline, but it doesn't appear at all reliable or based in reporting, just conjecture.
> Ars made up the thing about backups being deleted, which makes an exciting headline, but it doesn't appear at all reliable or based in reporting, just conjecture.
So you just label a reputable news outlet as fake news and then move on..?
“This is an isolated, ‘one-of-a-kind occurrence’ that has never before occurred with any of Google Cloud’s clients globally. This should not have happened. Google Cloud has identified the events that led to this disruption and taken measures to ensure this does not happen again.”
“UniSuper had backups in place with an additional service provider. These backups have minimised data loss, and significantly improved the ability of UniSuper and Google Cloud to complete the restoration.”
Those quotes were pulled directly from UniSuper’s website. Google deleted an account, lost the data, and then took 13 days to recover the pension fund from data stored on another data provider. Maybe you should consider that your employment at Google is damaging your objectivity.
Can you please quote in particular where unisuper mentions that the backups at Google were deleted?
Like I keep saying, nothing in the primary sources supports the claims either that backups or that the accounts were deleted. You've jumped to a particular conclusion, and seem unwilling adjust that conclusion in light of new evidence.
You quoted two paragraphs, none of which mentioned either Google missing backups, or unisupers account being deleted. I'm fact, what you quoted aligns perfectly with what I've been suggesting the whole time.
You're making a strong claim, I'm asking you to source it specifically. Instead you're taking a statement from which you can draw multiple conclusions, and picking one (that has been contradicted repeatedly) and telling me I'm unwilling to accept the facts. But they aren't facts, they're your interpretation of vague statements.
I'm happy to accept facts. Facts like "Data backups that were stored in Google Cloud Storage in the same region were not impacted by the deletion" are very easy to understand and difficult to misinterpret. Do you disagree?
“UniSuper had backups in place with an additional service provider. These backups have minimised data loss, and significantly improved the ability of UniSuper and Google Cloud to complete the restoration.”
Again, the same quote I already quoted above, direct from UniSuper’s website. They needed to use their backups at a different cloud provider, as GCP’s data wasn’t recoverable. I don’t know why you’re arguing so strongly against this.
That they used external backups doesn't actually imply anything about the GCS backups being unavailable. And Google's press release explicitly notes that unisuper used both. (And there's all kinds of reasons to have used both, both good and bad)
Put formally, we have statements that
- 1. A and B exist
- 2. A was used
- 3. A and B were used
Your conclusion from these statements is that, because (2) A was used, therefore B does not exist. Hopefully putting it like this makes it clear why I'm so confused.
> That was seven years later. Maybe the problem is that Google stopped reading what Google wrote.
The problem is that it was never that good. Anyone who has used K8s at scale will tell you at length how it doesn’t scale. People should stop focusing on tech companies like celebrities and focus instead on domain problems related to their business.
The funny thing with k8s is that Google doesn't use it (except GKE, and there's a reason it's one cluster per customer).
Their internal tooling scales just fine, but all it shares with k8s is some of the underlying concepts. Unlike, say, Bazel, gVisor or Gerrit, which are the real thing (minus some secret sauce tied to internal infra). k8s is good software, and best-in-class when it comes to open source options, but the idea that it is "open source Borg" is silly.
> k8s is good software, and best-in-class when it comes to open source options
No it isn’t, it’s a solution in search of a problem that is needlessly complex, wastes engineering cycles on what could have been product development, and has violated every principal of orthogonal design.
Oh, completely ignoring anything anyone from Google ever writes again? This is akin to the cancel culture which we all know is how society should work. /s
> Oh, completely ignoring anything anyone from Google ever writes again? This is akin to the cancel culture which we all know is how society should work. /s
Maybe if Google focused on doing actual work instead of writing feel good engineering pieces, they wouldn’t have the Google graveyard and an unstable cloud offering that may spontaneously delete multi-billion dollar accounts.
Boeing is a perfect example of this. I would absolutely read an article proposing principles of engineering reliability from a Boeing eng/QA greybeard. Even as the rest of the company spiraled due to horrible leadership and management practices, many people in engineering and quality control did their damnedest to keep those failures from causing even more harm and loss of life. Those people probably have very valuable lessons to share about how to maintain what quality you can in a deeply hostile environment.