"Type: Permanent; SubType: General; Code: smtp; 550-5.1.1 The email account that...

com2kid · on Dec 15, 2020

This is pretty much the worst response possible. Hard bounces mean that email delivery services are going to start automatically removing, or at least stopping delivery to, entire slews of email addresses.

A lot of clean up is going to be needed as a result of this.

To add some more details, when using a 3rd party email delivery service, those services will either black-list or just outright remove email addresses when they get a hard bounce "email address no longer exists" message back.

Some providers make re-adding an address after a hard bounce a non-trivial task, since after all, the authority on that email address just said it doesn't exist.

This is going to be really ugly.

octoberfranklin · on Dec 16, 2020

I really cannot believe they did not immediately hack in a new rule to their SMTP server: never return a 5xx (permanent failure), instead return a 421 (temporary failure try again later).

That simple fix buys them 24-72 hours to solve this properly.

Yeah, it burdens servers sending mail to them because now they have to hold on to all mail (including mail that really is permanently undeliverable) for another day or so, but that's still better than what's happening right now.

jeffbee · on Dec 16, 2020

Why would that be better than just shutting off the delivery stack altogether?

shmoogy · on Dec 16, 2020

5xx error results in suppression list addition of an email, so future emails won't be delivered (by most ESPs), and not returning MX response would probably be just as bad, or worse (or result in millions/billions of emails being re-queued due to timeouts?)

His solution would result in exponential retry failures baked into most services, which would buy them a few hours, and result in no lost emails, and no suppression list additions.

pmlnr · on Dec 16, 2020

Failure of response from the server is nearly always treated as temp failure, because it could be down to network connectivity, name resolution, etc.

That is a better scenario, than 5xx.

jeffbee · on Dec 16, 2020

Inability to contact the destination would be treated as a temp-failure by the origin, and taking the service off the air could be effected instantly.

octoberfranklin · on Dec 16, 2020

In case less than 100% of gmail is experiencing this bug.

jsnell · on Dec 16, 2020

This outage seems to have lasted for about 2.5 hours. Probably this was fixed by rolling back whatever caused it. (I don't think the rollout was finished before they resolved it; my mail server sends a lot of emails to Gmail addresses, and even at peak I was only seeing maybe about 1/3 mails be rejected.)

There is no way that putting in a hardcored hack like that would have been faster. Making the change is, of course, fast.

But then you need to review it (and this is a super risky change, so the review can't be rubber stamped). Build a production build and run all your qualification tests. (Hope you found all the tests that depend on permanent errors being signalled properly). And then roll it out globally, which again is a risky operation, but with the additional problem that rolling restarts simply can't be done faster than a certain speed since you can only restart so many processes at once while still continuing to serve traffic.

The kind of thing you describe simply can't be done by changing the SMTP server, in 2.5 hours. The best you could get is if there was some kind of abuse or security related articulation point in the system, with fast pushes as required by the problem domain but still with the sufficient power to either prevent the requests from reaching the SMTP server at all, or intercept and change the response.

As a trivial example, something like blocking the SMTP port with a firewall rule could have been viable. Though it has the cost of degrading performance for everyone rather than just the affected requests.

BubuIIC · on Dec 16, 2020

This has been going on for 2 days, not 2 hours.

jsnell · on Dec 16, 2020

The linked status page shows a 2.5 hour duration.

My mail server logs show about 20 failures in all of the last week until yesterday 20:43 CET, then 350 failures between 20:43-00:21, then nothing after that. So fair enough, from the client side rather than the status page it looks like 3.5 hours rather than 2.5.

But still, given that resolution time, the suggested solution of changing the SMTP server is absolutely ludicrous.

petercooper · on Dec 15, 2020

Yes. I email hundreds of thousands of Gmail users each week (yes, double opt in, they all want the mails!) and we immediately delete any user for whom any Gmail error comes up at all in order to keep a solid delivery record with them. Sounds like we might have deleted 80% of our list if we'd sent today..!

com2kid · on Dec 15, 2020

My sanity tests started acting flaky ~3 hours ago, I never thought it was Gmail...

Kind of happy I had to do something else and I didn't burn hours investigating.

dathinab · on Dec 16, 2020

So new think to do: Quarantine addresses instead of deleting them and if for one provider most addresses fail don't give them another (maybe manually triggered) try later one.

(And if no such thing is detected deleted quarantined mail addresses.)

webbie917 · on Dec 16, 2020

My guess is that how most email service providers handle this - they don't actually delete the email and just have a flag on it - bounced, complain, unsub. This way the list owner can run an export and see all the status code.

ta988 · on Dec 15, 2020

Hope you have a backup just in case.

petercooper · on Dec 15, 2020

Yes, we're unusual in not relying on third parties for list management. We can rollback. Or I might just comment out the 'unsub on hard bounce' code for the rest of the week..! :)

Jap2-0 · on Dec 15, 2020

Unsub on two consecutive bounces seems more reasonable to catch flukes (or Gmail going down)?

petercooper · on Dec 16, 2020

Yes, most likely! That is a common approach for 'soft bounces' in most list management systems (e.g. MailChimp).

The problem here is Gmail has been throwing out "NoSuchUser" errors which are an instant unsub in most systems because Gmail takes repeated delivery to non-existing addresses into account for deliverability purposes.

I'm extremely paranoid about email hygiene, tiny bounce rates and high delivery rates, so we aggressively unsubscribe troublesome addresses (often to the point of getting reader complaints about it) for many reasons beyond that, however.

octoberfranklin · on Dec 16, 2020

> Gmail takes repeated delivery to non-existing addresses into account for deliverability purposes.

I think you mean "reputation purposes"?

If so, wow, that sucks. Their opaque rules have conditioned their counterparties to punish Google as hard as possible for a screwup.

MagnumPIG · on Dec 16, 2020

> Their opaque rules have conditioned their counterparties to punish Google as hard as possible for a screwup.

Good for karma, bad for everyone though.

petercooper · on Dec 16, 2020

I think you mean "reputation purposes"?

That better describes what I was trying to say, yes. Reputation then affecting deliverability.

Over 80% of our subscribers use Gmail so to say I'm paranoid about maintaining a good record with them is an understatement ;-) Gmail is a huge weak link for us.

Jap2-0 · on Dec 16, 2020

Ah, thanks for the explanation.

dataflow · on Dec 15, 2020

Logically you'd expect unsubscribe to only act after lots of bounces of this format when the address has been receiving mail fine before. It also seems reasonable not to trust such bounces for the entire domain for a while when this happens to lots of other addresses that have worked fine before. Not that I expect software currently works this way, but it does seem like a common sense thing to code in.

petercooper · on Dec 16, 2020

I mean, it's possible, but you'd need to queue up a day's worth of bounces, do the analysis, and then handle the bounces asynchronously later on to do that.

Most systems operate more immediately in isolation on individual addresses than that right now, because such analysis is generally not needed (until today, of course ;-)).

dataflow · on Dec 16, 2020

Mail agents already queue emails that bounce though; it's a matter of changing the conditions for when you retry and/or unsubscribe. I imagine you can do the analysis in real time too... just look at the bounce and see if it pertains to an email you sent to in the past, and if so, increment some rolling counter for that domain.

octoberfranklin · on Dec 16, 2020

Their SMTP server being unreachable is a 4xx temporary error. The sender MUST keep trying for at least 24 hours, and 72 hours is recommended.

"Gmail going down" would not have caused this problem. Even if all their SMTP servers went offline.

organsnyder · on Dec 16, 2020

Yeah, they would have been better off pulling the (metaphorical) plug—maybe block incoming traffic to port 25 or something—until they had this fixed.

marvion · on Dec 16, 2020

Mailgun send a warning mail about increased bounces from our account. Sure, they know what's going on... but we send 4-5 digit mails per hour - it's a lot of bounces

That means I can't just resend the the emails blindly, because I'm too scared to trigger some sort of automatic suspension...

(I don't do this regularly, so I'm not familiar with all features... additional mail verification could help probably ....)

Bender · on Dec 15, 2020

They should be returning 421 for backend outages so that sending servers queue and retry the emails. 550 can be interpreted by some as deleted [1] or even banned accounts in some cases. Maybe someone here could convince them to change the logic that occurs during an outage.

[1] - https://en.wikipedia.org/wiki/List_of_SMTP_server_return_cod...

mlyle · on Dec 16, 2020

Yah. Maybe there's an unexpected way that things can fail resulting in 550's. But maybe at Google's scale you should have some kind of kill switch to stop answering SMTP or to not send permanent errors at all, so that you could flip a switch and prevent the worst consequences of this rather than let it go on for a couple of hours.

octoberfranklin · on Dec 16, 2020

Absolutely this.

I am astonished that either (a) this switch has not been flipped yet or (b) this switch does not exist.

Somebody is incompetent here.

mrich · on Dec 16, 2020

Perhaps Gmail is just being discontinued ;)

octoberfranklin · on Dec 16, 2020

don't get my hopes up!

megous · on Dec 15, 2020

A lot of people will lose transactional email messages, because of this.

I'd absolutely hate to be hit by this at this time. Thankfully I've made an time investment to run my own mail server years ago. A handful of times it broke down, it either went offline or started returning 4xx codes due to misconfigured or broken milter after an update. Neither meant lost messages from normal senders that use queuing MTAs.

alphadevx · on Dec 16, 2020

Same for me, mainly for privacy concerns. And I back it up daily to my local NAS. It's so easy to configure and run your own mail server, that I'm surprised we are the minority in the tech community.

snazz · on Dec 16, 2020

> It's so easy to configure and run your own mail server

Is it? Is dealing with IP reputation, getting your emails accepted by major providers, and being on the hook for fixing everything yourself very easy? I haven't tried, so I don't have personal experience, but I've heard enough horror stories to think that it's not a good use of my time.

megous · on Dec 17, 2020

Sending side of the MTA can be set up manually in about an hour on a Debian server, with dmarc, dkim, spf, etc. Make that a day if you want to read up on and understand each of the things in more detail, if you haven't configured them before. There's really not much to play with in this direction for a typical personal mail server.

Receiving side is where there is a great range of options, and many things to try and have fun with. You can have anything from a single catchall mailbox with no filtering, no GUI, and a simple IMAP or POP3 access for MUA, to a multi-account, multi-domain setup with server side filtering, database driven mailbox and alias management, proper TLS, web MUA access, etc. It can also be built up gradually, starting from very simple setup to something more complicated so that you never lose account of how things work.

alphadevx · on Dec 16, 2020

Mine are accepted by Gmail so I am good. Considering how dominant Gmail is, that's all that really matters.

Regarding getting a bad IP rating, normally that's due to having an insecure config, like acting as an open relay, or not having DKIM enabled. There are lots of tutorials online about this, if you know Linux it really is easy.

EarthIsHome · on Dec 16, 2020

I had an IP reputation issue and managed to resolve it after some time.

TLDR: Before you spin up a mail server, check if your IP address is on any of the blacklists [0]-[1] as well as Proof Point's list [2]. If it is, then try and get a different IP address.

I spun up a hosted server on Digital Ocean and received an IP address. I checked several black lists from a few email testing/troubleshooting sites [0] and [1] and all was groovy; my IP address wasn't on any list.

I got a bunch of 521 bounces when I tried emailing a neighbor who had an att.net address.

So, I checked the troubleshooting websites, and my IP address was listed as clean.

My logs said I should forward the error to abuse_rbl@abuse-att.net, so I did.

Those emails were never delivered, because abuse-att.net had its own blacklist. I was getting 553 errors. In the logs, the message from their server told me to check https://ipcheck.proofpoint.com.

Proof point runs their own blacklist that some enterprises use (e.g. att and apple [3]). I checked their list, and lo and behold, my IP address from Digital Ocean was blocked [2]. Digital Ocean wasn't able to remove the IP address from their blocklist and suggested I spin up a new droplet with a different IP address.

I didn't want to do that, so I sent Proof Point an email that went unanswered; the email asked them to remove my IP address. I forgot about the issue for five or six months (this is a personal server), and ran into the issue again a few months ago. So I sent Proof Point an email again, this time with different wording emphasizing that "my clients" were having delivery issues. Within a day, they removed my IP address from their block list.

So, my main suggestion is to check if your IP address is on any of the blacklists as well as Proof Point's list before you start on your server. If it is, then try and get a different IP address.

Does anyone have more "enterprise" lists, like Proof Point, to check?

[0]: https://www.mail-tester.com/

[1]: https://mxtoolbox.com/blacklists.aspx

[2]: https://ipcheck.proofpoint.com

[3]: https://www.reddit.com/r/email/comments/6toxzr/ip_blocked_by...

teddyh · on Dec 16, 2020

Used by Microsoft: https://sendersupport.olc.protection.outlook.com/snds/index....

jhawkinson · on Dec 15, 2020

It may be helpful to note that Google has acknowledged they are working on similar issues (the description is vague!) with an ETTR of 1900 EST:

https://www.google.com/appsstatus#hl=en&v=issue&sid=1&iid=a8...

On the other hand, their status dashboard reported similar issues yesterday and here we are again: https://www.google.com/appsstatus#hl=en&v=status

robarr · on Dec 15, 2020

Yes, hard bounces even between Gmail addresses.

caogecym · on Dec 15, 2020

just curious, how did you check bounces stats for Gmail?

clairegraham · on Dec 16, 2020

I also had the same hard bounce (when emailing from a non-gmail address -- fastmail -- to a gmail address). Sent it again minutes later and then it worked.

awb · on Dec 15, 2020

Incoming Gmail is bouncing, but I'm still able to access all prior received messages.

da_allgeier · on Dec 15, 2020

TL;DR; Don't sent your newsletters today if you can avoid it.