> I said this in another comment but this seems like a naive way to react to an ...

ascar · on Dec 16, 2020

Your answer kind of misses the point GP was trying to make.

Google's mailserver could genuinely believe that the user doesn't exist, if the user service doesn't fail completely but cannot access part of the data and thus doesn't find a user record. In this case the returned "user doesn't exist" error is intended behavior of the mail server and the post you replied to still stands. If you sent to that email successfully earlier, it's much more likely that the server is responding erroneously than that the email actually got deleted.

bayindirh · on Dec 16, 2020

> Your answer kind of misses the point GP was trying to make.

Actually, I don't think so.

> Google's mailserver could genuinely believe that the user doesn't exist, if the user service doesn't fail completely but cannot access part of the data and thus doesn't find a user record.

As a system administrator and/or provider you have to think about worst case scenarios and provide sensible defaults. Your mail gateway should have some heartbeat checks to subsystems it depend on (AuthZ, AuthN, Storage, etc.) and it should switch to fail-safe mode if something happens. Auth is unreliable? Switch to soft-fail on everyone regardless of e-mail validity. Can hard fail others later, when Auth is sane.

Storage is unreliable? Queue until buffer fills, then switch to error 421 (The service is unavailable due to a connection problem: it may refer to an exceeded limit of simultaneous connections, or a more general temporary problem) or return a similar error.

SMTP allows a lot of transient error communication. Postfix, etc. has a lot of hooks to handle this stuff. Just do it. Being Google doesn't allow you to manage your services irresponsibly. If we can think it, they should be able to do it too.

joana035 · on Dec 16, 2020

Technically speaking it's possible to soft bounce upon 5xx errors, but in practice, retrying even when the destination tells you not to is the quickest way to get reputation ruined.

Google SMTP servers should have returned a soft bounce here (not hard bounce), so then retry can work.

pw6hv · on Dec 16, 2020

But then why would Google's mailserver not know that it once delivered email to that mailbox?

If the protocol is stateful, why the state should be kept by the "sender" and not by the "receiver"? Being stateless removes this ambiguity in my opinion.

Also we should remember how bad is for spam reputation sending emails to a non-existent address and thus I would not blame it on the mailing list for being "overly cautious".

dredmorbius · on Dec 17, 2020

The situation here is that the service was so borked that it didn't know what it didn't know.

Hard-failing good addresses is a much worse bad than soft-failing bad addresses. In the latter case, remote sender tries again later and eventually gets a hard bounce. In the former, good addresses are permanently dropped from numerous services, and sent mail is lost rather than retried.

Critical failures should soft bounce until positively determined otherwise.

onion2k · on Dec 16, 2020

Google's user service should be able to tell the difference between a user's data not being available and a user that has been deleted or never existed in the first place. This issue is Google sending the wrong error code because of a problem on their end.

Mailing lists believing what an email provider tells them and acting in an overly cautious way is a separate issue.

thaumasiotes · on Dec 16, 2020

> Google's user service should be able to tell the difference between a user's data not being available and a user that has been deleted or never existed in the first place.

This can't work; you can say that gmail's system should have a component that recognizes the difference between various failures, but that new component can itself fail. You can't solve the problem of "what if something fails" by saying "just add a new component that won't fail".

makomk · on Dec 16, 2020

Of course it can. Software is complex and that complexity can cause all kinds of problems, as can the fact that the networks linking computers are unreliable, but software is fundamentally deterministic. If you write a piece of code that returns a temporary failure when it can't look up whether a user exists, that code will not mysteriously change itself to start returning permanent user does not exist errors. (Now, if your overall stack is designed in such a way that you can't reliably tell the difference between lookup failures and users that don't exist, you have a problem - but the problem is with the design of the system, not some inherent problem with software.)

Note that this is rather different from physical, mechanical systems which can fail in all kinds of exciting and unpredictable ways due to physical wear and tear, things getting jammed in places, component failure, etc.

thaumasiotes · on Dec 16, 2020

> but software is fundamentally deterministic.

That's true, but human behavior is also fundamentally deterministic, and those two observations are about equally useful.

> Note that this is rather different from physical, mechanical systems which can fail in all kinds of exciting and unpredictable ways due to physical wear and tear, things getting jammed in places, component failure, etc.

No it isn't. Those are deterministic too.

rav · on Dec 16, 2020

> that code will not mysteriously change itself to start returning permanent user does not exist errors

That is true in a perfect world. In the current world, there are all sorts of ways that code implemented one day does not run the same the next day. Say the code is in an interpreted language and an unrelated sysop updates the language runtime in a way that changes the behavior. Again, in a perfect world that doesn't happen, but that is not always the world we live in. I have great sympathy with people who treat software systems AS IF they were "physical, mechanical systems which can fail in all kinds of exciting and unpredictable ways".

labawi · on Dec 16, 2020

> doesn't fail completely but cannot access part of the data

If the a mail server can't tell whether a user/email is valid, it should either return a temporary failure or accept and queue.

Unless of course you're too big to fail, then you just do whatever you want.

tshaddox · on Dec 16, 2020

I think we’re just teasing at the notion that “permanent failure” isn’t a hard and fast distinction. I think some polite retry policy is not unreasonable even for the most explicit “permanent failure” response from a remote server. Imagine the most extreme example: hackers take over the remote server and make it respond with “permanent failure.” After a day, the legit owners regain control of the system. You can’t really argue that “the remote server never should have delivered that response unless the failure truly was permanent,” because clearly there was a mismatch between the apparent intent behind the response and the actual intent.

lukeramsden · on Dec 16, 2020

The issue is that hard bounces can cause big issues with your email sending reputation, and too many can make you lose access to mailing services such as Amazon SES, so you're encouraged at all points during the implementation of anything that sends email to blacklist any bounced emails. This of course works fine, right up until Gmail starts bouncing all emails.

dkdk8283 · on Dec 16, 2020

I think it’s spot on. Gmail’s failure mode in this scenario isn’t correct. The rest of the internet is functioning as designed.

StreamBright · on Dec 16, 2020

This is exactly it. The RFC has error codes for temporary failures (just like HTTP 503 for example). Failing to implement the RFC, the jokes on you.