As quite a few googlers appear to read and write on HN, I'd really welcome an insider info on what's going on the last few days.
Sure there will be some internal turmoil going on right now, but isn't there some non-confidential info to share? Can't imagine this will hurt the image of google neither in the short nor long run, quite the opposite.
I don’t work at Google, I’m at a different big tech that’s in the news frequently. Sharing inside info on an ongoing incident is a great way to get fired. Big tech companies are way different than startups where everyone can do a bit of anything. There are people whose job it is to handle that communication. You make their job a lot harder if you disclose information. The company is so big that as an engineer you may not know all the factors involved in what would hurt the company long term - undisclosed relevant litigation, compliance commitments, partner obligations, etc.
How much do you hate it as an engineer when sales people make tech promises to customers without asking you? For comms people, engineers leaking info publicly feels the same way.
I am very pleased to see this response, genuinely. Our Technical Curiosity aside, there are literally people and teams in such big firms dedicated for this.
What you're saying makes sense but I don't think it really applies to anything the OP said. The "non-confidential" qualifier indicates to me that they only want people to share what they can responsibly.
And the parent post’s point is that there are people whose job it is to specifically share that information, and so we should let them do their job. They are the domain expert in this particular task.
In lieu of an actual Googler, how about some educated speculation? It blows my mind that Google can even have problems like this. Aren't their apps highly distributed across tons of CDNs? Don't they have world class Devops people that roll out changes in a piecemeal fashion to check for bugs? How exactly can they have an issue that can affect a huge swath of their customers across countries? Insight appreciated.
Googler but nowhere near Gmail, so just educated speculation:
* We have a lot of automation/tools to prevent incidents when mitigation is straightforward (e.g. roll back a bad flag, quarantine unusual traffic patterns), which means that when something does go wrong it's often a new failure mode that needs custom, specialized mitigation. (e.g. what if you're in a situation where rolling back could make the problem worse? we might be Google, but we don't have magic wands)
* Debugging new failure modes is a coin flip: maybe your existing tools are sufficient to understand what's happening, but if they're not, getting that visibility can in itself be difficult. And just like everyone else, this can become a trial and error process: we find a plausible root cause, design and execute a mitigation based on that understanding, and then get more information that makes very clear that our hypothesis was incomplete (in the worst case, blatantly wrong).
We have a lot of automation/tools to prevent incidents when mitigation is straightforward (e.g. roll back a bad flag, quarantine unusual traffic patterns), which means that when something does go wrong it's often a new failure mode that needs custom, specialized mitigation.
As Douglas Adams says, "The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair."
Rollback proof bugs are rare, but boy howdy are they exciting. I think I've only seen one so far (unless you count bad data / bad state that persists after a bad change is rolled back... which can also be pretty exciting)
You can build rollbacks out of rollforwards, although it certainly isn't particularly fun. You patch an update to version N version code so that it's higher than N+1 and roll out the N+2 labelled N.
You don’t really have to speculate, they disclosed yesterday that yesterday’s issue had to do with the automated quota system deciding the auth system had zero quota:
Maybe they have world class DevOps, but they also have way more things that can go wrong than the vast majority of businesses. It's kind of remarkable that the entire world can be pinging Google services and they have ~99.9% uptime.
> It blows my mind that Google can even have problems like this.
When you operate at Google's scale then everything that can go wrong, will go wrong. Google does an amazing job providing high-availability services to billions of users, but doing so is a constant learning process; they are constantly blazing new trails for which there are no established best practices, and so there will always be unforeseen issues.
Yes, apps are highly distributed. Yes, roll-outs are staggered and controlled.
But some things are necessarily global. Things like your Google account are global (what went down the other day). Of course you can (and Google does) design such a system such that it's distributed and tolerant of any given piece failing. But it's still one system. And so, if something goes wrong in a new and exciting way... It might just happen to hit the service globally.
When things go down, it's because something weird happened. You don't hear about all the times the regular process prevented downtime... because things don't go down.
I speculate that for many companies, work from home has been at most, less impacting than they thought.
However, I'd speculate that in this instance, when you get that .0001% problem, less hands on deck makes work from home aspects less easier. Akin to remotely fixing somebodies PC over standing behind them.
With that premise I'd speculate in this instance that whilst not the root cause, may of been a small ripple that led to that root cause and/or lead to a slower resolution than what would normally get.
Those speculations aside, it will only highlight what that some tooling needs to adjust for remote workers as does design and set-ups more. Water cooler talk is not just for gossip and a counter would be more regular on-line group socialising at a work level so that not only the companies but the workers can fully adapt and embrace the work medium; But so the kinks and areas that need polishing can be polished and made better for all.
Lastly, I'd speculate that I'm totally wrong and yet what I said may well anecdote with some out there and resonate with others.
You might be right for the smaller company where physical access to the machines in the data center is necessary at a certain point in the troubleshooting process. I work at such a place myself. I would guess, however, that Google moved beyond that quite some time ago. It's simply not practical, with or without having offices with people in them.
Agreed. And I'd hope that their plan B of "get the whole team on Hangouts" isn't met with connection / auth issues. Kinda feel bad for the googlers. Hope they get this right.
I thought SMTP was specifically designed for this (with support for multiple MX entries, queuing at the sender MTA side, etc.) and there's an easy hard boundary at the user mailbox level you can use to partition your system.
It should not be a problem that gmail is "down". Unless this would be happening for more than a few days, noone would lose e-mail. It's a problem that it's not returning a temporary error code, but permanent one.
It is pretty clear that accepting a TCP connection and reading the bytes of the email from the sender is not the problem. Google is bouncing messages with an error like "that user doesn't exist". This would lead one to believe that some instances are having trouble looking up users, and that doesn't scale super easily. If the product guarantees that it will reject invalid email addresses (which is nice of them, not required by any spec), there has to be a globally consistent record of what email addresses are valid, and the accepting server has to look it up and react in the time that the sender is still connected to the mail server. You can't queue those and send the bounce later (there is no reliable "from" field in email; the only way to correctly bounce is while the sender is still connected). This basically means that you have on the order of milliseconds to accept or reject the email, so merely starting up a another replica of your SMTP daemon isn't going to mitigate this issue. The chokepoint is querying the list of users to see if you should bounce or accept the email. They made it hard on themselves by providing messages like "that user doesn't exist", but... it is nice when you email someone and you get the message "they got fired, sorry" instead of silence. So they made their system more complicated than it needed to be, for a better user experience, and now they are fighting a breakage.
I doubt that the delivery stack would 550 for mere trouble looking up an account. This smells more like the identity system was incorrectly returning authoritative denials.
Yeah, that sounds right to me. I would expect to see a temporary rejection with DEADLINE_EXCEEDED or something like that.
I think a lot of time and effort is spent categorizing errors from external systems into transient or permanent, and it's always kind of a one-off thing because some of them depend on the specifics of the calling application. It definitely takes some iteration to get it perfect, and it's very possible to make mistakes.
If it really doesn't want to accept emails for addresses that it doesn't know are valid, a well-behaving email server should send temporary failure codes when it can't look up if addresses are valid, and let the sender retry later when the address lookup is working and it can give a definite acceptance or rejection of the email. This is not even remotely a new problem, it comes up in email systems all the time because even at much smaller than Google scale they tend to be distributed systems. Someone screwed up.
> This basically means that you have on the order of milliseconds to accept or reject the email, so merely starting up a another replica of your SMTP daemon isn't going to mitigate this issue. The chokepoint is querying the list of users to see if you should bounce or accept the email.
You don't have milliseconds. You can take quite some time to handle the client. 10s of seconds for sure. For example default timeout for postfix smtp client when waiting for HELO is 5minutes.
If there is something I've learned from AWS outages (they tend to publish detailed post-mortem), no matter how you design your architecture in a distribute way you will always have Single Point of Failure (SPOF) and sometimes discover SPOF you didn't think of.
Sometimes it's a script responsible of deployment that will propagate an issue to the whole system. Sometimes it's the routing that will go wrong (for example when AWS routed all production traffic to the test cluster instead of production cluster).
Well, google did use a bunch of off the shelf technologies in the early days, but now it is obvious that there is no vendor on earth that could supply the infrastructure to run Gmail.
There used to be times when people didn't care for technicalities like this because the focus was on the person's contribution to the discussion.
Now that everyone's replaceable, the popular culture desperately tries to shift focus into arguing about pronouns and terms.
Watch out, this is a road to nowhere. Forcing others to use the right pronoun won't build up your retirement fund, but will distract you from worrying about not having one. And the fact that you care about it more than about your opponent's T-shirt color could be an indication that you are being manipulated to not think about the long-term things.
This is where it crosses from insightful into conspiracy theory territory for me. People seem perfectly capable of groupthink-deluding themselves. Why cheapen your argument by postulating some master manipulator when it's not necessary for the deeper point you're making?
It will only lead to people focussing the discussion to challenge this particular aspect, or them disregarding all you've said, instead of engaging with the actual meat of the argument.
'Singular "their" etc., was an accepted part of the English language before the 18th-century grammarians started making arbitrary judgements as to what is "good English" and "bad English", based on a kind of pseudo-"logic" deduced from the Latin language, that has nothing whatever to do with English... And even after the old-line grammarians put it under their ban, this anathematized singular "their" construction never stopped being used by English-speakers, both orally and by serious literary writers.'
The same reason it ever mattered how you refer to people, politeness and respect. If someone you consider "him" asks you to refer to them as "her" it's like someone asking you to call them by their full name "Rebecca" instead of "Becky" or "Jonathan" instead of "Jon". If you like and respect them, you do as they request because things which matter to them matter to you, and being polite to them is important to you. If you ignore what they ask, call them what you want, you communicate that you don't respect them and don't want to be polite, that you want to dominate and 'win' instead.
> "Pronouns can mean whatever you want them to mean"
Only one way. A specific person asking you to use a specific pronoun for themselves is wildly different from you unilaterally and universally saying that all women should feel included by the word "him" because "him" has no meaning anymore.
That varies based on location and regional dialect. Here in the northeast US, I remember using singular they/their since the 80's. It would be interesting to know when this become popular elsewhere.
80s in Australia too, been hearing/using it my whole life.
Though with respect to 'ages' apparently it's been around since at least the 14th century but certain purists tried to stamp it out at various times (just like the singular 'you' which no one currently has grammatical issues with I hope).
I remember some people tried to get BLM into German discussions, which made absolutely zero sense, as we have a complete different history and culture. Now I see this popping up. I really hope Europe can get some cultural distance between itself and the USA in the near future. The time is ripe.
I wonder if Gmail is just not a very well maintained codebase. Here's an issue where old emails just become inaccessible. Not fixed for over a year and they've locked the thread so I'm starting to wonder if they actually deleted the emails by mistake.
Some of the internal formatting that Gmail uses has changed over the years, so more likely than not the API that parses the stored message for display in the Gmail UI is just throwing some kind of error.
Right but the version of an email message you download via IMAP is different than the version of an email message you see in the Gmail UI. That's my point, that the error is probably in the way Google is processing messages for Gmail, so you wouldn't see it in IMAP or via the API.
Yes, I’ve been hearing about this issue from non-technical friends too. An explanation of “X crashed” helps even if they don’t actually understand what X is. The fact that someone figured it out and knows is reassuring.
Uneducated speculation, some sort of security incident. Whenever there is a major security issue in the wild, one of the big providers tends to have a problem within a few days.
Sure there will be some internal turmoil going on right now, but isn't there some non-confidential info to share? Can't imagine this will hurt the image of google neither in the short nor long run, quite the opposite.