More

redwards510 · on Dec 15, 2020

In lieu of an actual Googler, how about some educated speculation? It blows my mind that Google can even have problems like this. Aren't their apps highly distributed across tons of CDNs? Don't they have world class Devops people that roll out changes in a piecemeal fashion to check for bugs? How exactly can they have an issue that can affect a huge swath of their customers across countries? Insight appreciated.

joatmon-snoo · on Dec 15, 2020

Googler but nowhere near Gmail, so just educated speculation:

* We have a lot of automation/tools to prevent incidents when mitigation is straightforward (e.g. roll back a bad flag, quarantine unusual traffic patterns), which means that when something does go wrong it's often a new failure mode that needs custom, specialized mitigation. (e.g. what if you're in a situation where rolling back could make the problem worse? we might be Google, but we don't have magic wands)

* Debugging new failure modes is a coin flip: maybe your existing tools are sufficient to understand what's happening, but if they're not, getting that visibility can in itself be difficult. And just like everyone else, this can become a trial and error process: we find a plausible root cause, design and execute a mitigation based on that understanding, and then get more information that makes very clear that our hypothesis was incomplete (in the worst case, blatantly wrong).

userbinator · on Dec 16, 2020

We have a lot of automation/tools to prevent incidents when mitigation is straightforward (e.g. roll back a bad flag, quarantine unusual traffic patterns), which means that when something does go wrong it's often a new failure mode that needs custom, specialized mitigation.

As Douglas Adams says, "The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair."

missblit · on Dec 16, 2020

Rollback proof bugs are rare, but boy howdy are they exciting. I think I've only seen one so far (unless you count bad data / bad state that persists after a bad change is rolled back... which can also be pretty exciting)

Andrex · on Dec 16, 2020

Is "exciting" a synonym for "harrowing" where you're from? :P

vitus · on Dec 16, 2020

Chrome web store has no rollback strategy, there is only roll forward :(

joshuamorton · on Dec 16, 2020

You can build rollbacks out of rollforwards, although it certainly isn't particularly fun. You patch an update to version N version code so that it's higher than N+1 and roll out the N+2 labelled N.

Aperocky · on Dec 16, 2020

> what if you're in a situation where rolling back could make the problem worse?

Here comes the poison pills!

brown9-2 · on Dec 15, 2020

You don’t really have to speculate, they disclosed yesterday that yesterday’s issue had to do with the automated quota system deciding the auth system had zero quota:

https://status.cloud.google.com/incident/zall/20013#20013003

max_streese · on Dec 16, 2020

Thanks for providing this. It's funny to read the speculations when you have read the actual root cause :D

Well I guess the thing is left unanswered for now is why the quota management reduced the capacity for Google's IMS in the first place.

Maybe we will know someday :)

ravenstine · on Dec 15, 2020

Maybe they have world class DevOps, but they also have way more things that can go wrong than the vast majority of businesses. It's kind of remarkable that the entire world can be pinging Google services and they have ~99.9% uptime.

enneff · on Dec 16, 2020

> It blows my mind that Google can even have problems like this.

When you operate at Google's scale then everything that can go wrong, will go wrong. Google does an amazing job providing high-availability services to billions of users, but doing so is a constant learning process; they are constantly blazing new trails for which there are no established best practices, and so there will always be unforeseen issues.

marcan_42 · on Dec 16, 2020

Ex-Googler here.

Yes, apps are highly distributed. Yes, roll-outs are staggered and controlled.

But some things are necessarily global. Things like your Google account are global (what went down the other day). Of course you can (and Google does) design such a system such that it's distributed and tolerant of any given piece failing. But it's still one system. And so, if something goes wrong in a new and exciting way... It might just happen to hit the service globally.

When things go down, it's because something weird happened. You don't hear about all the times the regular process prevented downtime... because things don't go down.

Zenst · on Dec 15, 2020

I speculate that for many companies, work from home has been at most, less impacting than they thought.

However, I'd speculate that in this instance, when you get that .0001% problem, less hands on deck makes work from home aspects less easier. Akin to remotely fixing somebodies PC over standing behind them.

With that premise I'd speculate in this instance that whilst not the root cause, may of been a small ripple that led to that root cause and/or lead to a slower resolution than what would normally get.

Those speculations aside, it will only highlight what that some tooling needs to adjust for remote workers as does design and set-ups more. Water cooler talk is not just for gossip and a counter would be more regular on-line group socialising at a work level so that not only the companies but the workers can fully adapt and embrace the work medium; But so the kinks and areas that need polishing can be polished and made better for all.

Lastly, I'd speculate that I'm totally wrong and yet what I said may well anecdote with some out there and resonate with others.

throwaway201103 · on Dec 15, 2020

You might be right for the smaller company where physical access to the machines in the data center is necessary at a certain point in the troubleshooting process. I work at such a place myself. I would guess, however, that Google moved beyond that quite some time ago. It's simply not practical, with or without having offices with people in them.

marcan_42 · on Dec 16, 2020

All the access to the services is remote, but I'd say having the entire team in the same room does help coordinate incident response.

herodoturtle · on Dec 16, 2020

Agreed. And I'd hope that their plan B of "get the whole team on Hangouts" isn't met with connection / auth issues. Kinda feel bad for the googlers. Hope they get this right.

marcan_42 · on Dec 16, 2020

When I was there they had an IRC network for this reason. I hope they still do. Not quite the same as VoIP but fewer dependencies...

kuschku · on Dec 16, 2020

That's why the network folks at Google and AWS use IRC for just that purpose. Simple, no external dependencies, just works.

erhk · on Dec 15, 2020

Software isn't as simple as splitting across different locations to prevent global failures.

megous · on Dec 15, 2020

I thought SMTP was specifically designed for this (with support for multiple MX entries, queuing at the sender MTA side, etc.) and there's an easy hard boundary at the user mailbox level you can use to partition your system.

It should not be a problem that gmail is "down". Unless this would be happening for more than a few days, noone would lose e-mail. It's a problem that it's not returning a temporary error code, but permanent one.

jrockway · on Dec 16, 2020

It is pretty clear that accepting a TCP connection and reading the bytes of the email from the sender is not the problem. Google is bouncing messages with an error like "that user doesn't exist". This would lead one to believe that some instances are having trouble looking up users, and that doesn't scale super easily. If the product guarantees that it will reject invalid email addresses (which is nice of them, not required by any spec), there has to be a globally consistent record of what email addresses are valid, and the accepting server has to look it up and react in the time that the sender is still connected to the mail server. You can't queue those and send the bounce later (there is no reliable "from" field in email; the only way to correctly bounce is while the sender is still connected). This basically means that you have on the order of milliseconds to accept or reject the email, so merely starting up a another replica of your SMTP daemon isn't going to mitigate this issue. The chokepoint is querying the list of users to see if you should bounce or accept the email. They made it hard on themselves by providing messages like "that user doesn't exist", but... it is nice when you email someone and you get the message "they got fired, sorry" instead of silence. So they made their system more complicated than it needed to be, for a better user experience, and now they are fighting a breakage.

jeffbee · on Dec 16, 2020

I doubt that the delivery stack would 550 for mere trouble looking up an account. This smells more like the identity system was incorrectly returning authoritative denials.

jrockway · on Dec 16, 2020

Yeah, that sounds right to me. I would expect to see a temporary rejection with DEADLINE_EXCEEDED or something like that.

I think a lot of time and effort is spent categorizing errors from external systems into transient or permanent, and it's always kind of a one-off thing because some of them depend on the specifics of the calling application. It definitely takes some iteration to get it perfect, and it's very possible to make mistakes.

makomk · on Dec 16, 2020

If it really doesn't want to accept emails for addresses that it doesn't know are valid, a well-behaving email server should send temporary failure codes when it can't look up if addresses are valid, and let the sender retry later when the address lookup is working and it can give a definite acceptance or rejection of the email. This is not even remotely a new problem, it comes up in email systems all the time because even at much smaller than Google scale they tend to be distributed systems. Someone screwed up.

megous · on Dec 16, 2020

> This basically means that you have on the order of milliseconds to accept or reject the email, so merely starting up a another replica of your SMTP daemon isn't going to mitigate this issue. The chokepoint is querying the list of users to see if you should bounce or accept the email.

You don't have milliseconds. You can take quite some time to handle the client. 10s of seconds for sure. For example default timeout for postfix smtp client when waiting for HELO is 5minutes.

eloisant · on Dec 16, 2020

If there is something I've learned from AWS outages (they tend to publish detailed post-mortem), no matter how you design your architecture in a distribute way you will always have Single Point of Failure (SPOF) and sometimes discover SPOF you didn't think of.

Sometimes it's a script responsible of deployment that will propagate an issue to the whole system. Sometimes it's the routing that will go wrong (for example when AWS routed all production traffic to the test cluster instead of production cluster).

yudlejoza · on Dec 16, 2020

[flagged]

ink404 · on Dec 16, 2020

Your contribution has greatly enhanced this conversation, thank you.

aprdm · on Dec 15, 2020

Because, maybe, like in every big company, the thing actually doing the work is some old oracle database with some huge monolithic around it...

sellyme · on Dec 15, 2020

Out of all the companies Google might be relying on in their back-end, I think Oracle is probably pretty far down the list.

jeffbee · on Dec 16, 2020

I can’t imagine what part of Google’s history would lead someone to believe there was any third party system in their production stack anywhere.

marcan_42 · on Dec 16, 2020

Now their corporate/finance stack on the other hand... shudder.

enneff · on Dec 16, 2020

Well, google did use a bunch of off the shelf technologies in the early days, but now it is obvious that there is no vendor on earth that could supply the infrastructure to run Gmail.

still_grokking · on Dec 16, 2020

Didn't they use GNU/Linux form day one on?

jeffbee · on Dec 16, 2020

Closed-source like Oracle I meant. They've been big boosters of all kinds of open-source stuff like linux, llvm, mysql, ...

pmlnr · on Dec 15, 2020

Hush, you'll scare the shiny eyed faang wannabies away, they aren't supposed to know this until employed for at least two decades.

redwards510 · on Dec 15, 2020

Agreed. it drives me absolutely berserk that Reddit forces you to "Click to view more comments" just to see like 3 more comments.

hyperhopper · on Dec 15, 2020

That particular anti user behavior is a strategy to pay reddit gold, which removes that limitation.

redwards510 · on July 29, 2019

Then why does it say this:

    we also settled on the number, and the implementation of, privacy switches on the phone – there will be 4 switches in total: for the i) BT/Wifi module, ii) the modem, iii) cameras (front/back) and iv) lastly for the microphone.

and right on the pinephone page it says it has this chip:

https://www.quectel.com/product/eg25g.htm

    Key Benefits

    ● LTE Cat 4 module optimized for M2M and IoT applications
    ● Worldwide LTE, UMTS/HSPA+ and GSM/GPRS/EDGE coverage
    ● Multi-constellation GNSS receiver available for applications requiring fast and accurate fixes in any environment
    ● Feature refinements: supports DFOTA*, eCall and DTMF*
    ● MIMO technology meets demands for data rate and link reliability in modem wireless communication systems

Sounds like a cell radio to me.

redwards510 · on July 29, 2019

You aren't buying a phone, you are buying an Experience.

AgloeDreams · on July 29, 2019

Otherwise known as 'we found a way to charge ~$350 for an Open Source OS.'

RandomBacon · on July 29, 2019

If it was that easy, then they would have released it months ago. No, it takes work and work costs money.

redwards510 · on July 29, 2019

Without a browser, it kind of ceases to be a smartphone. It's just a flipphone. So I don't see how that is really something you need to note.

For me it's Google Maps. There is just no comparison. I don't want to imagine trying to drive around finding places without it. It's not just the navigation and traffic, it's the business hours lookup feature and all that.

I hadn't thought about TOTP stores (Authy), but that's certainly way up there. Unless they can read a NFC yubikey.

ocdtrekkie · on July 29, 2019

See, the problem is, as long as Google is allowed to own services like Maps and YouTube, as well as Android, you've ensured you can't ever switch to a new platform: Google is extremely active in not only refusing to support competing mobile platforms, but actively blocking them from accessing their services, as they've done with Microsoft and Amazon's competing platforms over the years.

I've found for most map queries, Google, Bing, MapQuest, and HERE are all pretty much comparable/a wash.

necovek · on July 30, 2019

It's weird that someone as tied to Google Maps would be looking into a privacy-oriented phone like Librem 5.

AFAIK, Google Maps will record your location/speed even while you are not using it (to get their traffic information), it will record any nearby wifi APs to help populate their aGPS database (so your private APs will get recorded along with their locations on Google's servers), and who knows what else.

Why would you want a Librem 5?

redwards510 · on July 29, 2019

I can't find the info for the price, but I remember seeing it before and it was very low, like $200, which would make this a very exciting entry into the notAndroidOrIOS market. The Librem 5 was cool at first, until they started delaying it indefinitely and the price is too high for mass adoption. By having a very low price on the Pinephone they are courting a lot more tinkerers to get one and help develop the OS versus the Librem which people would kind of (rightly) expect to be fully-formed at launch.

    As things stand today, we expect a fully functional prototype in August

https://www.pine64.org/2019/06/06/june-2019-news-pinephone-p...

https://wiki.pine64.org/index.php/Project_Don't_be_evil

allana · on July 29, 2019

The Allwinner A64 platform has numerous developers working on it compared to the iMX8, which has really fleshed out fully featured hardware support over the last few years. Running most Linux distros should be a breeze on the Pinephone!

megous · on July 30, 2019

That's a bit of an overstatement. It has some interested and active developers, but support is far from fleshed out, especially for the poratble device use.

There are lots of patches floating around, that you can apply to the mainline linux tree, to get CPU DVFS, thermal management, to improve I2S support, etc. Suspend to RAM is nowhere to be seen, leaving you with always ON SoC, which will drain the battery in half a day doing nothing.

It's all fixable, but let's not pretend A64 has fully fleshed out HW support in the mainline Linux kernel.

OTOH, there's a great potential, because Quectel EC25 broadband module also runs Linux and has some potential for being hackable:

https://projects.osmocom.org/projects/quectel-modems/wiki

So there's definitely a lot of fun to be had with the future PinePhone. You'll be able to log in to your broadband module and perhaps modify it:

https://projects.osmocom.org/projects/quectel-modems/wiki/EC...

RandomBacon · on July 29, 2019

Thank you! I will keep an eye on this.

redwards510 · on July 26, 2019

Really want to pull the trigger on this, the Privacy Switches to disable camera, mic, wifi at the firmware level appealed to me. If you search around you'll find a video the creator made demo'ing skipless 1080p/4k video playback. However, it's probably prudent to wait for some of the initial bugs to get ironed out first.

They also claim to be making no profit off of these.

giancarlostoro · on July 26, 2019

Wish they would let you donate an additional amount if its not making a profit. I want to see them build more things. I kept wanting the former model but they never got back to me on it. I put my email several times.

redwards510 · on July 25, 2019

Well said. In addition to simply being naive, I think there is also the self-delusion that everyone subscribes to when they convince themselves that their job (no matter what it is) is somehow helping contribute something good to the world. I had a relative that worked as a telemarketer selling predatory debt consolidation packages to people on the verge of bankruptcy and when he described what he was doing, he enthusiastically framed it like he was helping the people from going bankrupt, even though he was just helping them dig a bigger hole.

If your company is a public corporation, your only purpose is to maximize shareholder value.

redwards510 · on July 25, 2019

Why do you say "HN" is falling for it? Hacker News did not write the article. Buzzfeed did. We're simply commenting on it.

wutbrodo · on July 27, 2019

I'm not sure I understand this comment at all.

Buzzfeed is the one writing the clickbait. Obviously they would not be the ones falling for anything. HN, collectively, put this article on the front page, treating shallow nonsense as if it's notable because it feeds our biases. It's a data point that speaks ill of the collective discernment abilities of HN (and the quality of the community) that we put content-free clickbait on the front page and discuss it as if it's valuable (or more accurately, use it as a fig leaf for soapboxing preexisting views).

This isn't some generalized "HN is full of dumb low-effort takes and clickbait" complaint; I got used to that years and years ago. But part of being on an online forum is pushing back against stuff like this and calling out low quality content when we see it.

redwards510 · on July 22, 2019

Even if you avoid providing your DOB, you almost always have to put the year you graduated college, and that makes it pretty easy to ballpark age.

rongenre · on July 22, 2019

I leave my graduation date off -- it's never been a problem.