It is pretty clear that accepting a TCP connection and reading the bytes of the ...

jeffbee · on Dec 16, 2020

I doubt that the delivery stack would 550 for mere trouble looking up an account. This smells more like the identity system was incorrectly returning authoritative denials.

jrockway · on Dec 16, 2020

Yeah, that sounds right to me. I would expect to see a temporary rejection with DEADLINE_EXCEEDED or something like that.

I think a lot of time and effort is spent categorizing errors from external systems into transient or permanent, and it's always kind of a one-off thing because some of them depend on the specifics of the calling application. It definitely takes some iteration to get it perfect, and it's very possible to make mistakes.

makomk · on Dec 16, 2020

If it really doesn't want to accept emails for addresses that it doesn't know are valid, a well-behaving email server should send temporary failure codes when it can't look up if addresses are valid, and let the sender retry later when the address lookup is working and it can give a definite acceptance or rejection of the email. This is not even remotely a new problem, it comes up in email systems all the time because even at much smaller than Google scale they tend to be distributed systems. Someone screwed up.

megous · on Dec 16, 2020

> This basically means that you have on the order of milliseconds to accept or reject the email, so merely starting up a another replica of your SMTP daemon isn't going to mitigate this issue. The chokepoint is querying the list of users to see if you should bounce or accept the email.

You don't have milliseconds. You can take quite some time to handle the client. 10s of seconds for sure. For example default timeout for postfix smtp client when waiting for HELO is 5minutes.