Hacker News new | past | comments | ask | show | jobs | submit login

This is the Nightmare Scenario for mailing lists.

Many of them auto-unsubscribe after a bounce.




I said this in another comment but this seems like a naive way to react to an "address does not exist error" that they've already delivered to before. The only legit scenario in which that happens is when the user deletes the address, which is a rare event (pretty much always <= 1 time in the lifetime of any address), and there shouldn't be anything wrong with treating that kind of situation the same as any soft error. If you're wrong, your mail will just get rejected a few more times anyway, and you'll know it's genuinely a dead end.

The underlying issue (wherever this occurs) seems to be lack of nuance regarding error codes when people try to implement robust systems. Different codes imply different things and shouldn't all just fall back into generic buckets.


> I said this in another comment but this seems like a naive way to react to an "address does not exist error" that they've already delivered to before.

Like HTTP, SMTP is also designed to be stateless so, in the first place, the remote server shouldn't return a permanent error in temporary failure scenarios.

The default error should be 450: "Requested action not taken – The user’s mailbox is unavailable”, not "the user has deleted everything and left".

These standards worked well before big players came and told "My responses tell what I chose them to say, and these meaning doesn't always overlap with the established standards". The only exception is spam and we now have standards for helping to reduce it.


Your answer kind of misses the point GP was trying to make.

Google's mailserver could genuinely believe that the user doesn't exist, if the user service doesn't fail completely but cannot access part of the data and thus doesn't find a user record. In this case the returned "user doesn't exist" error is intended behavior of the mail server and the post you replied to still stands. If you sent to that email successfully earlier, it's much more likely that the server is responding erroneously than that the email actually got deleted.


> Your answer kind of misses the point GP was trying to make.

Actually, I don't think so.

> Google's mailserver could genuinely believe that the user doesn't exist, if the user service doesn't fail completely but cannot access part of the data and thus doesn't find a user record.

As a system administrator and/or provider you have to think about worst case scenarios and provide sensible defaults. Your mail gateway should have some heartbeat checks to subsystems it depend on (AuthZ, AuthN, Storage, etc.) and it should switch to fail-safe mode if something happens. Auth is unreliable? Switch to soft-fail on everyone regardless of e-mail validity. Can hard fail others later, when Auth is sane.

Storage is unreliable? Queue until buffer fills, then switch to error 421 (The service is unavailable due to a connection problem: it may refer to an exceeded limit of simultaneous connections, or a more general temporary problem) or return a similar error.

SMTP allows a lot of transient error communication. Postfix, etc. has a lot of hooks to handle this stuff. Just do it. Being Google doesn't allow you to manage your services irresponsibly. If we can think it, they should be able to do it too.


Technically speaking it's possible to soft bounce upon 5xx errors, but in practice, retrying even when the destination tells you not to is the quickest way to get reputation ruined.

Google SMTP servers should have returned a soft bounce here (not hard bounce), so then retry can work.


But then why would Google's mailserver not know that it once delivered email to that mailbox?

If the protocol is stateful, why the state should be kept by the "sender" and not by the "receiver"? Being stateless removes this ambiguity in my opinion.

Also we should remember how bad is for spam reputation sending emails to a non-existent address and thus I would not blame it on the mailing list for being "overly cautious".


The situation here is that the service was so borked that it didn't know what it didn't know.

Hard-failing good addresses is a much worse bad than soft-failing bad addresses. In the latter case, remote sender tries again later and eventually gets a hard bounce. In the former, good addresses are permanently dropped from numerous services, and sent mail is lost rather than retried.

Critical failures should soft bounce until positively determined otherwise.


Google's user service should be able to tell the difference between a user's data not being available and a user that has been deleted or never existed in the first place. This issue is Google sending the wrong error code because of a problem on their end.

Mailing lists believing what an email provider tells them and acting in an overly cautious way is a separate issue.


> Google's user service should be able to tell the difference between a user's data not being available and a user that has been deleted or never existed in the first place.

This can't work; you can say that gmail's system should have a component that recognizes the difference between various failures, but that new component can itself fail. You can't solve the problem of "what if something fails" by saying "just add a new component that won't fail".


Of course it can. Software is complex and that complexity can cause all kinds of problems, as can the fact that the networks linking computers are unreliable, but software is fundamentally deterministic. If you write a piece of code that returns a temporary failure when it can't look up whether a user exists, that code will not mysteriously change itself to start returning permanent user does not exist errors. (Now, if your overall stack is designed in such a way that you can't reliably tell the difference between lookup failures and users that don't exist, you have a problem - but the problem is with the design of the system, not some inherent problem with software.)

Note that this is rather different from physical, mechanical systems which can fail in all kinds of exciting and unpredictable ways due to physical wear and tear, things getting jammed in places, component failure, etc.


> but software is fundamentally deterministic.

That's true, but human behavior is also fundamentally deterministic, and those two observations are about equally useful.

> Note that this is rather different from physical, mechanical systems which can fail in all kinds of exciting and unpredictable ways due to physical wear and tear, things getting jammed in places, component failure, etc.

No it isn't. Those are deterministic too.


> that code will not mysteriously change itself to start returning permanent user does not exist errors

That is true in a perfect world. In the current world, there are all sorts of ways that code implemented one day does not run the same the next day. Say the code is in an interpreted language and an unrelated sysop updates the language runtime in a way that changes the behavior. Again, in a perfect world that doesn't happen, but that is not always the world we live in. I have great sympathy with people who treat software systems AS IF they were "physical, mechanical systems which can fail in all kinds of exciting and unpredictable ways".


> doesn't fail completely but cannot access part of the data

If the a mail server can't tell whether a user/email is valid, it should either return a temporary failure or accept and queue.

Unless of course you're too big to fail, then you just do whatever you want.


I think we’re just teasing at the notion that “permanent failure” isn’t a hard and fast distinction. I think some polite retry policy is not unreasonable even for the most explicit “permanent failure” response from a remote server. Imagine the most extreme example: hackers take over the remote server and make it respond with “permanent failure.” After a day, the legit owners regain control of the system. You can’t really argue that “the remote server never should have delivered that response unless the failure truly was permanent,” because clearly there was a mismatch between the apparent intent behind the response and the actual intent.


The issue is that hard bounces can cause big issues with your email sending reputation, and too many can make you lose access to mailing services such as Amazon SES, so you're encouraged at all points during the implementation of anything that sends email to blacklist any bounced emails. This of course works fine, right up until Gmail starts bouncing all emails.


I think it’s spot on. Gmail’s failure mode in this scenario isn’t correct. The rest of the internet is functioning as designed.


This is exactly it. The RFC has error codes for temporary failures (just like HTTP 503 for example). Failing to implement the RFC, the jokes on you.


If Google and other major mail providers weren't opaque about this, then fine, but for me a single bounce is an immediate removal. I can't take the risk. I can't imagine the hell that would ensue trying to get through to Google to ask them to take me off their deliverability shitlist.


Has anybody ever received a reply from gmail's postmaster address?

I have good experience with them fixing issues related to their spam-related flagging for messages that are coming from our self-hosted email server, but never got any specific reply.


I 100% assure you that everyone handling gmail errors and getting burned isn’t just tossing failures into a single bucket. There’s a zillion reasons mail can bounce and all of them are taken into account. This is a particular bounce code that signifies that an ESP shouldn’t send email again to this address.

Email service providers are HIGHLY incentivized to act 100% in accordance with the wishes of the system where the mailbox exists because it’s highly likely that acting in any way that’s considered abusive could get your emails landing in a spam folder.

Mail boxes cease to exist thousands of times a day at places I’ve worked previously. Employees leave all the time and people shutdown mailboxes, this is Google’s fuckup, nobody else’s.


There is actually a very good reason to drop these email addresses, and the reason is that a high-rate of non-deliverable emails hurt your sender score. It's a total pain to get emails delivered to the major email providers in the first place, and you immediately land in spam (or with emails not delivered at all) if they don't trust the sending email server or your score is anything but stellar!


I have 2 responses to the sender reputation concern:

1. If the user's mail service penalizes you equally regardless of whether the recipient's addressed existed 1 day vs. never existed, that itself is absolutely inexcusable nonsensical behavior that needs to be fixed. You shouldn't do that, just as you shouldn't shoot the mailman (or even arm yourself...) merely because he knocked a second time.

2. Notwithstanding the previous point, I don't buy this as valid justification anyway. The proposal isn't that you should blast 100 emails toward the mailbox every time you get a bounce due to an address not existing. The idea was to just exercise some intelligence in the matter. Like maybe just retry a couple times, spaced out by a day or two. The bounce rate increase due to such an event is very negligible here—people don't suddenly delete their accounts en masse. When that happens, it's clearly due to an outage, not because half the users at that domain suddenly decided to delete their accounts. (Which is something you can also easily detect across the domain as another useful signal to drastically lower the bounce rate across the entire domain, btw, if you're absolutely paranoid about your immaculate delivery rate dropping by an epsilon. But it shouldn't be necessary given how negligible the impact should be.)

So I don't buy this excuse one bit.


> The proposal isn't that you should blast 100 emails toward the mailbox every time you get a bounce due to an address not existing. The idea was to just exercise some intelligence in the matter. Like maybe just retry a couple times, spaced out by a day or two.

What you're proposing is to explicitly ignore the specification (which says that you should _not_ retry when you receive a 550) and try to implement a custom smart retry logic that handles temporary error cases, but also does not get you blocked.

> So I don't buy this excuse one bit.

I'm all for building resilient services, but "try to detect when a server incorrectly returns 550" is not something I would prioritize at all. I'll happily manually clean up after this occurrence than to have this complicated retry logic. It's not an "excuse", it's a very sensible trade-off.


No, I am quite explicitly not ignoring the spec. It quite deliberately says should not, not must not. If anyone is ignoring the spec here, it's you, not me. Should not is sound advice; it's telling you what you're supposed to do when you don't have a reason to behave differently. You know, like how you "should not" leave the lights on when you leave your room. Or—more pertinently here—how you "should not" assume everyone is a liar. But when you actively see evidence that deviates from the norm, you are given the power—and arguably the responsibility—to exercise your discretion here to adapt to the situation. If the spec wanted blind obedience, it would say "must not" like it did in 60 other places, but it quite obviously and intentionally decided that would be unwise, and this scenario seems like a pretty clear illustration of that.


But the RFC isn't only for senders it's also for receivers, isn't it?

That means there are two sides to the interpretation of what SHOULD NOT means. And in this case, senders have, due to experience, interpreted what Google does when someone SHOULD NOTs:

- The sender SHOULD NOT send us the same sequence again when we reply 550, if they do they MUST go on our shitlist.

Obviously it's not so binary and it takes retrying to several different recipients, but people have very good reason to interpret this SHOULD NOT as MUST NOT.


No, that's not a sane way to interpret this RFC for the receiver either. I already answered this, so you'll have to go back to my earlier comment (this might be my last comment as I won't keep repeating myself): any system (be it Google's or anyone else's) that penalizes you equally regardless of whether the recipient's addressed existed 1 day vs. never existed is just plain trash. A sender that attempts delivery to an address that accepted their email a day ago is obviously unlikely to be a spammer; there's no justification for treating them as one. It is absolutely unreasonable to interpret the sentence this way. Just as it's unreasonable to interpret "the mailman shouldn't knock a second time when he's told the recipient has moved" as "I should never open the door for the mailman ever again if he does so".


Good callout. The underlying issue of the lack of nuance is probably /state/. Being more nuanced about these errors probably requires managing state, which tends to increase the complexity and scaling challenges.


Nuance is not called for. The standard states that a 5xx SMTP error is a permanent error and "The SMTP client SHOULD NOT repeat the exact request"

Gmail screwed up here, returning a 550 error, it's not anyone else's job to try to second guess that or retry in contradiction of the accepted standard.

https://tools.ietf.org/html/rfc5321


Gmail screwed up, but that's beside the point. We're talking about designing robust systems. You don't design a robust system by assuming nobody will screw up!

Re: the RFC, note it says "should not", not "must not". That seems to suggest they acknowledge repeating might actually make sense in some cases. And honestly the practicalities of this situation and the risk-reward tradeoff seriously tilts toward repeating the request later regardless of what the RFC says. The world isn't going to end.


Try delivering to invalid email addresses too many times (too many of course being up to each mail provider), and you will be the one shitlisted (and rightfully so, as you are likely bruteforce enumerating valid email addresses).

For any small provider, getting on the shitlist is catastrophic as unlike the big providers, getting off of it will be hard / impossible.


Rules for thee, not for me


> And honestly the practicalities of this situation and the risk-reward tradeoff seriously tilts toward repeating the request later regardless of what the RFC says. The world isn't going to end.

That is exactly the thought process that leads to non-standard mess that we see numerous examples of.

If you believe the standard is not robust enough to handle problems like this, first work towards a fix to the standard and then implement the solution. Not the other way round.


> That is exactly the thought process that leads to non-standard mess that we see numerous examples of.

I didn't suggest people should apply this thought process in arbitrary cases. I said it should be applied in this case. You can take any thought process that gives a good outcome in one situation and obtain a bad outcome by applying it to the wrong situation. That's not an indictment of the thought process. It's just an indictment of the person failing to correctly judge its applicability.

That said, by all means, do try and go fix the standard; I wasn't trying to imply you shouldn't do that.


Ah I think I did not describe the repercussions of making exceptions (even if they are in highly specialized cases like this). If you allow yourself to make such exceptions, you diminish the motivation for you (or someone else) to fix the problem at the right place. Most workarounds tend to live forever.


There's no clear-cut rule here. Some workarounds stay workarounds and never get standardized. Some become so well-accepted and adopted that people then put them into standards. It's great to put things into standards, so by all means, do try to improve standards. But that shouldn't block you from everything. At the end of the day, standardization is just a means to an end, and the end is what matters here. Nobody cares if their mailman's knocks follows an RFC or not. They just want their mailman to deliver packages with reasonably minimal disruption.


> There's no clear-cut rule here

Exactly, that is why it is important to follow standards. Most engineering decisions are not clear-cut and are born out of tradeoffs. That is why we agree on standards that define those tradeoffs instead of every one of us having our own take on situations.

> Nobody cares if their mailman's knocks follows an RFC or not

If there is a Mailman RFC which says: "If someone opens the door and says `Mike does not live here' then DO NOT attempt delivering the same package"

THEN I expect the mailman to not bother me again, EVEN IF it was actually my mistake that I forgot my roommate Mike actually does live at this address.


I'm tired of arguing about this. Engineers agree on standards for a good reason, yes, but they also agree on "should not" rather than "must not" for a good reason too. I'll leave this as my last comment, but you might want to read the post-mortem. Turns out their implementation of the RFC wasn't even buggy. They just messed up the domain name in the configuration. Which you can only be resilient to by retrying the request sometime later.


But here’s the thing: the standard (like all standards) is obviously not robust enough to physically prevent responses which incorrectly indicate permanent failure.

These incorrect responses could be caused by mistakes which the remote server admins could reasonably avoid, like software bugs. I understand not having much sympathy for that case, especially from an organization with no shortage of resources. But they could also be caused by, for example, hackers or governments exerting control over the remote server temporarily.

A standard which explicitly refuses to acknowledge these possibilities is not what I would describe as “robust.” An obvious better alternative would be to set some standards around what constitutes a polite retry policy.


My understanding is that should not means that you should not try to retry. If I do retry than the other party can rightfully claim that I am DDOSing their service, trying to send emails to deleted accounts or put me on a spam list. I do not think that ignoring the RFC and trying to cover up for Google is the best course of action here. Maybe, just maybe, this is the right time when people realise what does it really mean to have an entity like Google. Because as it is stands, we are going to have the DNS infrastructure moved over to them with DoH and a similar outage is going to be even more devastating. The internet was designed to be resilient to failure because of its distributed nature and right now it just shows why concentrating resources in one place is bad.


You "should not" repeat delivery in basically the same way the mailman "should not" knock a second time if he's told the recipient doesn't reside at the designated address. What "should not" means in these cases is: "knock only once, and assume you're being told the truth in the absence of further evidence to the contrary". But when you clearly saw the recipient reside there yesterday, it makes sense to try to knock and catch him again tomorrow. Because, you know, maybe something went wrong, e.g. maybe the person who opened the door didn't recognize the name (or whatever). At the end of the day, the mailman's job is to deliver the mail with minimal disruption, not to play hot potato with envelopes.


The terminology is well defined [0], so in this case, retrying is not ignoring the RFC.

It's a difficult one though, because as you rightfully state, covering up for Google is not the best course of action for the system as a whole, yet it's likely a good course of action for those users who didn't get their emails.

[0]: 4. SHOULD NOT This phrase, or the phrase "NOT RECOMMENDED" mean that there may exist valid reasons in particular circumstances when the particular behavior is acceptable or even useful, but the full implications should be understood and the case carefully weighed before implementing any behavior described with this label.

[1]: https://tools.ietf.org/html/rfc2119


In most internet engineering task force RFCs the standard verbiage for "must not" usually is in fact "should not".


The phrase "must not" appears some 60 times in this RFC.


Thanks for pointing that out. I suppose a RFC writing style guide would be helpful to have consistency in language and interpretation.


The standard says “don’t resend,” it doesn’t say “assume the worst and begin removing user from all systems.” That was the mailing list software’s decision.


You generally avoid sending to known bad addresses or your reputation will be destroyed very quickly. The 550 response is (read: was) a clear "you fucked up, this user doesn't exist" prior to this.

I saw someone on Reddit say his SES was suspended for sending tons of bounced emails in a short period of time - it's taken very seriously by ESPs.

E: also user rtx a few comments below


We're not talking about repeating the exact request; a subsequent request for the same recipient would be to deliver a completely different message: whatever subsequent message is sent to the mailing list.


Right. In this case it's already pretty typical for mailing lists to track bounces and retry under some errors, so I imagined that part is mostly done, and the missing piece would be taking more care in checking the error conditions.


Aside - I'm not an expert but systems like MailChimp will get very worked up if your list has lots of undeliverable addresses on it. This can trigger an audit of your list which prevents sending, etc. These audits seem to take quite a while, in my very limited experience.


So what you're saying is, if you're annoyed by "subscribe to our mailing list" modal popups, "doesno5exist@garbage.blah" is better than "jeff@amazon.com" ?


In practice, no, it's more nuanced than that. Any mailing list operated through any remotely legitimate ESP will require subscriptions to be confirmed/acknowledged up front before any delivery is attempted to a recipient. If the confirmation step fails, i.e. the "check your email and click a link to verify you really signed up" email bounces, or nobody ever clicks the link, the list owner isn't generally going to be penalized for that.

If you want revenge for modal popups, your best bet is to create a bunch of throwaway email accounts, subscribe to the mailing list from them, and start reporting the individual messages as spam when they arrive. Flag them as junk at the mailbox provider (Gmail, Outlook, etc.) and use the links in the List-Unsubscribe headers to flag them at the ESP's end, too.


If you're trying to get the web site's mail server blacklisted, definitely.


Aka throw the RFC out of the window and implement a broken system because Google did that?


> I said this in another comment but this seems like a naive way

That's the standards-compliant way. Also I'd argue that spec'ing your code to handle cases where Google fails that badly is (was?) a poor allocation of LoCs.


You're entirely missing the point by blaming this on Google. This is meant to detect and handle some failure modes, and they could happen to anyone (including Google), for reasons that can be both inside and outside their control.


I had this issue with GitLab. My email provider returned a permanent error one day (due to an issue on their end), so GitLab silently stopped sending any emails to my address. I checked my email in the preferences many times and had no idea it was blocked on GitLab's end. Eventually, after not getting any notifications, I contacted their customer service and was told of this hidden setting.

So if you are not getting any notifications from GitLab, even though your email is correct, I suggest contacting them and asking if you have been blocked due to an error.


I posted this as a problem in my problem validation platform[1] and a user has built a quick solution by displaying a token if the email service received an email from the sender.

[1]'Check email service status before sending emails' - https://needgap.com/problems/178-check-email-service-status-...


Great point. And potentially email delivery services that have auto-suppression lists to protect reputation, at least they might be able to remove entries on behalf of their customers.


Good. I was hoping this was the case. Unfortunately I already moved to fastmail so there will be little benefit to me.


Oh no.


My account with Amazon went in to review because of this. I hope their team is aware about it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: