In lieu of an actual Googler, how about some educated speculation? It blows my m...

joatmon-snoo · on Dec 15, 2020

Googler but nowhere near Gmail, so just educated speculation:

* We have a lot of automation/tools to prevent incidents when mitigation is straightforward (e.g. roll back a bad flag, quarantine unusual traffic patterns), which means that when something does go wrong it's often a new failure mode that needs custom, specialized mitigation. (e.g. what if you're in a situation where rolling back could make the problem worse? we might be Google, but we don't have magic wands)

* Debugging new failure modes is a coin flip: maybe your existing tools are sufficient to understand what's happening, but if they're not, getting that visibility can in itself be difficult. And just like everyone else, this can become a trial and error process: we find a plausible root cause, design and execute a mitigation based on that understanding, and then get more information that makes very clear that our hypothesis was incomplete (in the worst case, blatantly wrong).

userbinator · on Dec 16, 2020

We have a lot of automation/tools to prevent incidents when mitigation is straightforward (e.g. roll back a bad flag, quarantine unusual traffic patterns), which means that when something does go wrong it's often a new failure mode that needs custom, specialized mitigation.

As Douglas Adams says, "The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair."

missblit · on Dec 16, 2020

Rollback proof bugs are rare, but boy howdy are they exciting. I think I've only seen one so far (unless you count bad data / bad state that persists after a bad change is rolled back... which can also be pretty exciting)

Andrex · on Dec 16, 2020

Is "exciting" a synonym for "harrowing" where you're from? :P

vitus · on Dec 16, 2020

Chrome web store has no rollback strategy, there is only roll forward :(

joshuamorton · on Dec 16, 2020

You can build rollbacks out of rollforwards, although it certainly isn't particularly fun. You patch an update to version N version code so that it's higher than N+1 and roll out the N+2 labelled N.

Aperocky · on Dec 16, 2020

> what if you're in a situation where rolling back could make the problem worse?

Here comes the poison pills!

brown9-2 · on Dec 15, 2020

You don’t really have to speculate, they disclosed yesterday that yesterday’s issue had to do with the automated quota system deciding the auth system had zero quota:

https://status.cloud.google.com/incident/zall/20013#20013003

max_streese · on Dec 16, 2020

Thanks for providing this. It's funny to read the speculations when you have read the actual root cause :D

Well I guess the thing is left unanswered for now is why the quota management reduced the capacity for Google's IMS in the first place.

Maybe we will know someday :)

ravenstine · on Dec 15, 2020

Maybe they have world class DevOps, but they also have way more things that can go wrong than the vast majority of businesses. It's kind of remarkable that the entire world can be pinging Google services and they have ~99.9% uptime.

enneff · on Dec 16, 2020

> It blows my mind that Google can even have problems like this.

When you operate at Google's scale then everything that can go wrong, will go wrong. Google does an amazing job providing high-availability services to billions of users, but doing so is a constant learning process; they are constantly blazing new trails for which there are no established best practices, and so there will always be unforeseen issues.

marcan_42 · on Dec 16, 2020

Ex-Googler here.

Yes, apps are highly distributed. Yes, roll-outs are staggered and controlled.

But some things are necessarily global. Things like your Google account are global (what went down the other day). Of course you can (and Google does) design such a system such that it's distributed and tolerant of any given piece failing. But it's still one system. And so, if something goes wrong in a new and exciting way... It might just happen to hit the service globally.

When things go down, it's because something weird happened. You don't hear about all the times the regular process prevented downtime... because things don't go down.

Zenst · on Dec 15, 2020

I speculate that for many companies, work from home has been at most, less impacting than they thought.

However, I'd speculate that in this instance, when you get that .0001% problem, less hands on deck makes work from home aspects less easier. Akin to remotely fixing somebodies PC over standing behind them.

With that premise I'd speculate in this instance that whilst not the root cause, may of been a small ripple that led to that root cause and/or lead to a slower resolution than what would normally get.

Those speculations aside, it will only highlight what that some tooling needs to adjust for remote workers as does design and set-ups more. Water cooler talk is not just for gossip and a counter would be more regular on-line group socialising at a work level so that not only the companies but the workers can fully adapt and embrace the work medium; But so the kinks and areas that need polishing can be polished and made better for all.

Lastly, I'd speculate that I'm totally wrong and yet what I said may well anecdote with some out there and resonate with others.

throwaway201103 · on Dec 15, 2020

You might be right for the smaller company where physical access to the machines in the data center is necessary at a certain point in the troubleshooting process. I work at such a place myself. I would guess, however, that Google moved beyond that quite some time ago. It's simply not practical, with or without having offices with people in them.

marcan_42 · on Dec 16, 2020

All the access to the services is remote, but I'd say having the entire team in the same room does help coordinate incident response.

herodoturtle · on Dec 16, 2020

Agreed. And I'd hope that their plan B of "get the whole team on Hangouts" isn't met with connection / auth issues. Kinda feel bad for the googlers. Hope they get this right.

marcan_42 · on Dec 16, 2020

When I was there they had an IRC network for this reason. I hope they still do. Not quite the same as VoIP but fewer dependencies...

kuschku · on Dec 16, 2020

That's why the network folks at Google and AWS use IRC for just that purpose. Simple, no external dependencies, just works.

erhk · on Dec 15, 2020

Software isn't as simple as splitting across different locations to prevent global failures.

megous · on Dec 15, 2020

I thought SMTP was specifically designed for this (with support for multiple MX entries, queuing at the sender MTA side, etc.) and there's an easy hard boundary at the user mailbox level you can use to partition your system.

It should not be a problem that gmail is "down". Unless this would be happening for more than a few days, noone would lose e-mail. It's a problem that it's not returning a temporary error code, but permanent one.

jrockway · on Dec 16, 2020

It is pretty clear that accepting a TCP connection and reading the bytes of the email from the sender is not the problem. Google is bouncing messages with an error like "that user doesn't exist". This would lead one to believe that some instances are having trouble looking up users, and that doesn't scale super easily. If the product guarantees that it will reject invalid email addresses (which is nice of them, not required by any spec), there has to be a globally consistent record of what email addresses are valid, and the accepting server has to look it up and react in the time that the sender is still connected to the mail server. You can't queue those and send the bounce later (there is no reliable "from" field in email; the only way to correctly bounce is while the sender is still connected). This basically means that you have on the order of milliseconds to accept or reject the email, so merely starting up a another replica of your SMTP daemon isn't going to mitigate this issue. The chokepoint is querying the list of users to see if you should bounce or accept the email. They made it hard on themselves by providing messages like "that user doesn't exist", but... it is nice when you email someone and you get the message "they got fired, sorry" instead of silence. So they made their system more complicated than it needed to be, for a better user experience, and now they are fighting a breakage.

jeffbee · on Dec 16, 2020

I doubt that the delivery stack would 550 for mere trouble looking up an account. This smells more like the identity system was incorrectly returning authoritative denials.

jrockway · on Dec 16, 2020

Yeah, that sounds right to me. I would expect to see a temporary rejection with DEADLINE_EXCEEDED or something like that.

I think a lot of time and effort is spent categorizing errors from external systems into transient or permanent, and it's always kind of a one-off thing because some of them depend on the specifics of the calling application. It definitely takes some iteration to get it perfect, and it's very possible to make mistakes.

makomk · on Dec 16, 2020

If it really doesn't want to accept emails for addresses that it doesn't know are valid, a well-behaving email server should send temporary failure codes when it can't look up if addresses are valid, and let the sender retry later when the address lookup is working and it can give a definite acceptance or rejection of the email. This is not even remotely a new problem, it comes up in email systems all the time because even at much smaller than Google scale they tend to be distributed systems. Someone screwed up.

megous · on Dec 16, 2020

> This basically means that you have on the order of milliseconds to accept or reject the email, so merely starting up a another replica of your SMTP daemon isn't going to mitigate this issue. The chokepoint is querying the list of users to see if you should bounce or accept the email.

You don't have milliseconds. You can take quite some time to handle the client. 10s of seconds for sure. For example default timeout for postfix smtp client when waiting for HELO is 5minutes.

eloisant · on Dec 16, 2020

If there is something I've learned from AWS outages (they tend to publish detailed post-mortem), no matter how you design your architecture in a distribute way you will always have Single Point of Failure (SPOF) and sometimes discover SPOF you didn't think of.

Sometimes it's a script responsible of deployment that will propagate an issue to the whole system. Sometimes it's the routing that will go wrong (for example when AWS routed all production traffic to the test cluster instead of production cluster).

yudlejoza · on Dec 16, 2020

[flagged]

ink404 · on Dec 16, 2020

Your contribution has greatly enhanced this conversation, thank you.

aprdm · on Dec 15, 2020

Because, maybe, like in every big company, the thing actually doing the work is some old oracle database with some huge monolithic around it...

sellyme · on Dec 15, 2020

Out of all the companies Google might be relying on in their back-end, I think Oracle is probably pretty far down the list.

jeffbee · on Dec 16, 2020

I can’t imagine what part of Google’s history would lead someone to believe there was any third party system in their production stack anywhere.

marcan_42 · on Dec 16, 2020

Now their corporate/finance stack on the other hand... shudder.

enneff · on Dec 16, 2020

Well, google did use a bunch of off the shelf technologies in the early days, but now it is obvious that there is no vendor on earth that could supply the infrastructure to run Gmail.

still_grokking · on Dec 16, 2020

Didn't they use GNU/Linux form day one on?

jeffbee · on Dec 16, 2020

Closed-source like Oracle I meant. They've been big boosters of all kinds of open-source stuff like linux, llvm, mysql, ...

pmlnr · on Dec 15, 2020

Hush, you'll scare the shiny eyed faang wannabies away, they aren't supposed to know this until employed for at least two decades.