As quite a few googlers appear to read and write on HN, I'd really welcome an in...

kleinsch · on Dec 16, 2020

I don’t work at Google, I’m at a different big tech that’s in the news frequently. Sharing inside info on an ongoing incident is a great way to get fired. Big tech companies are way different than startups where everyone can do a bit of anything. There are people whose job it is to handle that communication. You make their job a lot harder if you disclose information. The company is so big that as an engineer you may not know all the factors involved in what would hurt the company long term - undisclosed relevant litigation, compliance commitments, partner obligations, etc.

How much do you hate it as an engineer when sales people make tech promises to customers without asking you? For comms people, engineers leaking info publicly feels the same way.

iKevinShah · on Dec 16, 2020

I am very pleased to see this response, genuinely. Our Technical Curiosity aside, there are literally people and teams in such big firms dedicated for this.

rurp · on Dec 16, 2020

What you're saying makes sense but I don't think it really applies to anything the OP said. The "non-confidential" qualifier indicates to me that they only want people to share what they can responsibly.

enneff · on Dec 16, 2020

And the parent post’s point is that there are people whose job it is to specifically share that information, and so we should let them do their job. They are the domain expert in this particular task.

rurp · on Dec 17, 2020

For any incident like this there are tons of details that are both

1) Harmless to share 2) Will never be shared by PR teams

I don't see anything wrong with asking people to share what they can.

enneff · on Dec 17, 2020

There’s nothing wrong with asking. I’m just explaining that as a Google employee, sharing such details is poor form.

programmerslave · on Dec 16, 2020

[flagged]

retsibsi · on Dec 16, 2020

> These companies wouldn’t hesitate to kick you out on the street if they had to

> Sharing inside info on an ongoing incident is a great way to get fired

You're not disagreeing.

asdfaoeu · on Dec 16, 2020

He literally just said they wouldn't hesitate to kick you out on the street if they had to

redwards510 · on Dec 15, 2020

In lieu of an actual Googler, how about some educated speculation? It blows my mind that Google can even have problems like this. Aren't their apps highly distributed across tons of CDNs? Don't they have world class Devops people that roll out changes in a piecemeal fashion to check for bugs? How exactly can they have an issue that can affect a huge swath of their customers across countries? Insight appreciated.

joatmon-snoo · on Dec 15, 2020

Googler but nowhere near Gmail, so just educated speculation:

* We have a lot of automation/tools to prevent incidents when mitigation is straightforward (e.g. roll back a bad flag, quarantine unusual traffic patterns), which means that when something does go wrong it's often a new failure mode that needs custom, specialized mitigation. (e.g. what if you're in a situation where rolling back could make the problem worse? we might be Google, but we don't have magic wands)

* Debugging new failure modes is a coin flip: maybe your existing tools are sufficient to understand what's happening, but if they're not, getting that visibility can in itself be difficult. And just like everyone else, this can become a trial and error process: we find a plausible root cause, design and execute a mitigation based on that understanding, and then get more information that makes very clear that our hypothesis was incomplete (in the worst case, blatantly wrong).

userbinator · on Dec 16, 2020

We have a lot of automation/tools to prevent incidents when mitigation is straightforward (e.g. roll back a bad flag, quarantine unusual traffic patterns), which means that when something does go wrong it's often a new failure mode that needs custom, specialized mitigation.

As Douglas Adams says, "The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair."

missblit · on Dec 16, 2020

Rollback proof bugs are rare, but boy howdy are they exciting. I think I've only seen one so far (unless you count bad data / bad state that persists after a bad change is rolled back... which can also be pretty exciting)

Andrex · on Dec 16, 2020

Is "exciting" a synonym for "harrowing" where you're from? :P

vitus · on Dec 16, 2020

Chrome web store has no rollback strategy, there is only roll forward :(

joshuamorton · on Dec 16, 2020

You can build rollbacks out of rollforwards, although it certainly isn't particularly fun. You patch an update to version N version code so that it's higher than N+1 and roll out the N+2 labelled N.

Aperocky · on Dec 16, 2020

> what if you're in a situation where rolling back could make the problem worse?

Here comes the poison pills!

brown9-2 · on Dec 15, 2020

You don’t really have to speculate, they disclosed yesterday that yesterday’s issue had to do with the automated quota system deciding the auth system had zero quota:

https://status.cloud.google.com/incident/zall/20013#20013003

max_streese · on Dec 16, 2020

Thanks for providing this. It's funny to read the speculations when you have read the actual root cause :D

Well I guess the thing is left unanswered for now is why the quota management reduced the capacity for Google's IMS in the first place.

Maybe we will know someday :)

ravenstine · on Dec 15, 2020

Maybe they have world class DevOps, but they also have way more things that can go wrong than the vast majority of businesses. It's kind of remarkable that the entire world can be pinging Google services and they have ~99.9% uptime.

enneff · on Dec 16, 2020

> It blows my mind that Google can even have problems like this.

When you operate at Google's scale then everything that can go wrong, will go wrong. Google does an amazing job providing high-availability services to billions of users, but doing so is a constant learning process; they are constantly blazing new trails for which there are no established best practices, and so there will always be unforeseen issues.

marcan_42 · on Dec 16, 2020

Ex-Googler here.

Yes, apps are highly distributed. Yes, roll-outs are staggered and controlled.

But some things are necessarily global. Things like your Google account are global (what went down the other day). Of course you can (and Google does) design such a system such that it's distributed and tolerant of any given piece failing. But it's still one system. And so, if something goes wrong in a new and exciting way... It might just happen to hit the service globally.

When things go down, it's because something weird happened. You don't hear about all the times the regular process prevented downtime... because things don't go down.

Zenst · on Dec 15, 2020

I speculate that for many companies, work from home has been at most, less impacting than they thought.

However, I'd speculate that in this instance, when you get that .0001% problem, less hands on deck makes work from home aspects less easier. Akin to remotely fixing somebodies PC over standing behind them.

With that premise I'd speculate in this instance that whilst not the root cause, may of been a small ripple that led to that root cause and/or lead to a slower resolution than what would normally get.

Those speculations aside, it will only highlight what that some tooling needs to adjust for remote workers as does design and set-ups more. Water cooler talk is not just for gossip and a counter would be more regular on-line group socialising at a work level so that not only the companies but the workers can fully adapt and embrace the work medium; But so the kinks and areas that need polishing can be polished and made better for all.

Lastly, I'd speculate that I'm totally wrong and yet what I said may well anecdote with some out there and resonate with others.

throwaway201103 · on Dec 15, 2020

You might be right for the smaller company where physical access to the machines in the data center is necessary at a certain point in the troubleshooting process. I work at such a place myself. I would guess, however, that Google moved beyond that quite some time ago. It's simply not practical, with or without having offices with people in them.

marcan_42 · on Dec 16, 2020

All the access to the services is remote, but I'd say having the entire team in the same room does help coordinate incident response.

herodoturtle · on Dec 16, 2020

Agreed. And I'd hope that their plan B of "get the whole team on Hangouts" isn't met with connection / auth issues. Kinda feel bad for the googlers. Hope they get this right.

marcan_42 · on Dec 16, 2020

When I was there they had an IRC network for this reason. I hope they still do. Not quite the same as VoIP but fewer dependencies...

kuschku · on Dec 16, 2020

That's why the network folks at Google and AWS use IRC for just that purpose. Simple, no external dependencies, just works.

erhk · on Dec 15, 2020

Software isn't as simple as splitting across different locations to prevent global failures.

megous · on Dec 15, 2020

I thought SMTP was specifically designed for this (with support for multiple MX entries, queuing at the sender MTA side, etc.) and there's an easy hard boundary at the user mailbox level you can use to partition your system.

It should not be a problem that gmail is "down". Unless this would be happening for more than a few days, noone would lose e-mail. It's a problem that it's not returning a temporary error code, but permanent one.

jrockway · on Dec 16, 2020

It is pretty clear that accepting a TCP connection and reading the bytes of the email from the sender is not the problem. Google is bouncing messages with an error like "that user doesn't exist". This would lead one to believe that some instances are having trouble looking up users, and that doesn't scale super easily. If the product guarantees that it will reject invalid email addresses (which is nice of them, not required by any spec), there has to be a globally consistent record of what email addresses are valid, and the accepting server has to look it up and react in the time that the sender is still connected to the mail server. You can't queue those and send the bounce later (there is no reliable "from" field in email; the only way to correctly bounce is while the sender is still connected). This basically means that you have on the order of milliseconds to accept or reject the email, so merely starting up a another replica of your SMTP daemon isn't going to mitigate this issue. The chokepoint is querying the list of users to see if you should bounce or accept the email. They made it hard on themselves by providing messages like "that user doesn't exist", but... it is nice when you email someone and you get the message "they got fired, sorry" instead of silence. So they made their system more complicated than it needed to be, for a better user experience, and now they are fighting a breakage.

jeffbee · on Dec 16, 2020

I doubt that the delivery stack would 550 for mere trouble looking up an account. This smells more like the identity system was incorrectly returning authoritative denials.

jrockway · on Dec 16, 2020

Yeah, that sounds right to me. I would expect to see a temporary rejection with DEADLINE_EXCEEDED or something like that.

I think a lot of time and effort is spent categorizing errors from external systems into transient or permanent, and it's always kind of a one-off thing because some of them depend on the specifics of the calling application. It definitely takes some iteration to get it perfect, and it's very possible to make mistakes.

makomk · on Dec 16, 2020

If it really doesn't want to accept emails for addresses that it doesn't know are valid, a well-behaving email server should send temporary failure codes when it can't look up if addresses are valid, and let the sender retry later when the address lookup is working and it can give a definite acceptance or rejection of the email. This is not even remotely a new problem, it comes up in email systems all the time because even at much smaller than Google scale they tend to be distributed systems. Someone screwed up.

megous · on Dec 16, 2020

> This basically means that you have on the order of milliseconds to accept or reject the email, so merely starting up a another replica of your SMTP daemon isn't going to mitigate this issue. The chokepoint is querying the list of users to see if you should bounce or accept the email.

You don't have milliseconds. You can take quite some time to handle the client. 10s of seconds for sure. For example default timeout for postfix smtp client when waiting for HELO is 5minutes.

eloisant · on Dec 16, 2020

If there is something I've learned from AWS outages (they tend to publish detailed post-mortem), no matter how you design your architecture in a distribute way you will always have Single Point of Failure (SPOF) and sometimes discover SPOF you didn't think of.

Sometimes it's a script responsible of deployment that will propagate an issue to the whole system. Sometimes it's the routing that will go wrong (for example when AWS routed all production traffic to the test cluster instead of production cluster).

yudlejoza · on Dec 16, 2020

[flagged]

ink404 · on Dec 16, 2020

Your contribution has greatly enhanced this conversation, thank you.

aprdm · on Dec 15, 2020

Because, maybe, like in every big company, the thing actually doing the work is some old oracle database with some huge monolithic around it...

sellyme · on Dec 15, 2020

Out of all the companies Google might be relying on in their back-end, I think Oracle is probably pretty far down the list.

jeffbee · on Dec 16, 2020

I can’t imagine what part of Google’s history would lead someone to believe there was any third party system in their production stack anywhere.

marcan_42 · on Dec 16, 2020

Now their corporate/finance stack on the other hand... shudder.

enneff · on Dec 16, 2020

Well, google did use a bunch of off the shelf technologies in the early days, but now it is obvious that there is no vendor on earth that could supply the infrastructure to run Gmail.

still_grokking · on Dec 16, 2020

Didn't they use GNU/Linux form day one on?

jeffbee · on Dec 16, 2020

Closed-source like Oracle I meant. They've been big boosters of all kinds of open-source stuff like linux, llvm, mysql, ...

pmlnr · on Dec 15, 2020

Hush, you'll scare the shiny eyed faang wannabies away, they aren't supposed to know this until employed for at least two decades.

rat9988 · on Dec 15, 2020

I would advice anyone to not share any information that his company hasn't agreed explicitely to share.

ab8 · on Dec 15, 2020

Your username is rat9988. Been burned in the past?

zadokshi · on Dec 16, 2020

Management at google are poking in to check up on their staff, to make sure nothing leaks.

gurjeet · on Dec 15, 2020

[flagged]

john_moscow · on Dec 15, 2020

There used to be times when people didn't care for technicalities like this because the focus was on the person's contribution to the discussion.

Now that everyone's replaceable, the popular culture desperately tries to shift focus into arguing about pronouns and terms.

Watch out, this is a road to nowhere. Forcing others to use the right pronoun won't build up your retirement fund, but will distract you from worrying about not having one. And the fact that you care about it more than about your opponent's T-shirt color could be an indication that you are being manipulated to not think about the long-term things.

octoberfranklin · on Dec 16, 2020

This is a surprisingly profound and insightful comment so deep in the subthread of, more or less, a shitpost.

Thank you, sir, for elevating our collective level of discourse.

dxdm · on Dec 16, 2020

> you are being manipulated to not think

This is where it crosses from insightful into conspiracy theory territory for me. People seem perfectly capable of groupthink-deluding themselves. Why cheapen your argument by postulating some master manipulator when it's not necessary for the deeper point you're making?

It will only lead to people focussing the discussion to challenge this particular aspect, or them disregarding all you've said, instead of engaging with the actual meat of the argument.

icedchai · on Dec 15, 2020

'Their' works fine and has been gender-neutral English for ages.

throwaway201103 · on Dec 15, 2020

[flagged]

icedchai · on Dec 16, 2020

Okay, so use "their." It is gender neutral, so should work for everyone.

throwaway201103 · on Dec 16, 2020

It's also wrong. because it's not singular. Makes for difficult reading.

j13k · on Dec 16, 2020

From https://www.pemberley.com/janeinfo/austheir.html:

'Singular "their" etc., was an accepted part of the English language before the 18th-century grammarians started making arbitrary judgements as to what is "good English" and "bad English", based on a kind of pseudo-"logic" deduced from the Latin language, that has nothing whatever to do with English... And even after the old-line grammarians put it under their ban, this anathematized singular "their" construction never stopped being used by English-speakers, both orally and by serious literary writers.'

icedchai · on Dec 16, 2020

It's not "wrong." Language is fluid and singular they is widely accepted. A previous poster linked to an article showing centuries of such usage.

jodrellblank · on Dec 16, 2020

> "so what does it matter anymore"

The same reason it ever mattered how you refer to people, politeness and respect. If someone you consider "him" asks you to refer to them as "her" it's like someone asking you to call them by their full name "Rebecca" instead of "Becky" or "Jonathan" instead of "Jon". If you like and respect them, you do as they request because things which matter to them matter to you, and being polite to them is important to you. If you ignore what they ask, call them what you want, you communicate that you don't respect them and don't want to be polite, that you want to dominate and 'win' instead.

> "Pronouns can mean whatever you want them to mean"

Only one way. A specific person asking you to use a specific pronoun for themselves is wildly different from you unilaterally and universally saying that all women should feel included by the word "him" because "him" has no meaning anymore.

erhk · on Dec 15, 2020

Ages is subjective, it came back in to popularity only recently

icedchai · on Dec 16, 2020

That varies based on location and regional dialect. Here in the northeast US, I remember using singular they/their since the 80's. It would be interesting to know when this become popular elsewhere.

hnick · on Dec 16, 2020

80s in Australia too, been hearing/using it my whole life.

Though with respect to 'ages' apparently it's been around since at least the 14th century but certain purists tried to stamp it out at various times (just like the singular 'you' which no one currently has grammatical issues with I hope).

https://public.oed.com/blog/a-brief-history-of-singular-they...

kharak · on Dec 16, 2020

I remember some people tried to get BLM into German discussions, which made absolutely zero sense, as we have a complete different history and culture. Now I see this popping up. I really hope Europe can get some cultural distance between itself and the USA in the near future. The time is ripe.

bryant · on Dec 15, 2020

> s/his/her/

s/his/their

makeworld · on Dec 15, 2020

I believe you mean:

s/s\/his\/her/s\/his\/their/

maest · on Dec 15, 2020

s#s/his/her#s/his/their# also works and avoids awkward escaping. The first symbol after s is used as the separator. Works in vim, at least.

In other words:

s%s/s\\/his\\/her/s\\/his\\/their/%s#s/his/her#s/his/their#%

monsieurbanana · on Dec 16, 2020

Did you just assume my regex engine is pcre gendered???

makeworld · on Dec 16, 2020

Wow, I've never heard this joke before. Original and well-applied to the situation.

aaronmdjones · on Dec 16, 2020

    awkward escaping

Or as I've seen it called, "leaning toothpick syndrome".

User23 · on Dec 15, 2020

The question is what exactly is the new “feature” that got pushed skipping canary is.

bigiain · on Dec 15, 2020

NSA backdoor? <smirk>

lrossi · on Dec 15, 2020

Since so little time has passed since the last issue, I am wondering if it could be the same cause. Maybe they didn’t fix it properly the first time.

pmlnr · on Dec 15, 2020

Or simply trying to roll something out again, same that failed before.

zmmmmm · on Dec 16, 2020

it's got a similar flavor - that was identity management going down, this is "that email account doesn't exist".

IshKebab · on Dec 15, 2020

I wonder if Gmail is just not a very well maintained codebase. Here's an issue where old emails just become inaccessible. Not fixed for over a year and they've locked the thread so I'm starting to wonder if they actually deleted the emails by mistake.

https://support.google.com/mail/thread/6187016

Maybe time to switch to a more reliable provider.

Alex3917 · on Dec 16, 2020

> Not fixed for over a year and they've locked the thread so I'm starting to wonder if they actually deleted the emails by mistake.

Did you try pulling them down using the API tester?: https://developers.google.com/gmail/api/reference/rest/v1/us...

Some of the internal formatting that Gmail uses has changed over the years, so more likely than not the API that parses the stored message for display in the Gmail UI is just throwing some kind of error.

IshKebab · on Dec 16, 2020

I didn't but I did try Takeout and they weren't in it.

Either way my point is that this is a pretty serious bug and they haven't even acknowledged it! Not a good look.

edoceo · on Dec 16, 2020

I've never had issues over IMAP with old (decade) message in gmail

Alex3917 · on Dec 16, 2020

Right but the version of an email message you download via IMAP is different than the version of an email message you see in the Gmail UI. That's my point, that the error is probably in the way Google is processing messages for Gmail, so you wouldn't see it in IMAP or via the API.

divbzero · on Dec 15, 2020

Yes, I’ve been hearing about this issue from non-technical friends too. An explanation of “X crashed” helps even if they don’t actually understand what X is. The fact that someone figured it out and knows is reassuring.

Spooky23 · on Dec 15, 2020

Uneducated speculation, some sort of security incident. Whenever there is a major security issue in the wild, one of the big providers tends to have a problem within a few days.