This outage seems to have lasted for about 2.5 hours. Probably this was fixed by rolling back whatever caused it. (I don't think the rollout was finished before they resolved it; my mail server sends a lot of emails to Gmail addresses, and even at peak I was only seeing maybe about 1/3 mails be rejected.)
There is no way that putting in a hardcored hack like that would have been faster. Making the change is, of course, fast.
But then you need to review it (and this is a super risky change, so the review can't be rubber stamped). Build a production build and run all your qualification tests. (Hope you found all the tests that depend on permanent errors being signalled properly). And then roll it out globally, which again is a risky operation, but with the additional problem that rolling restarts simply can't be done faster than a certain speed since you can only restart so many processes at once while still continuing to serve traffic.
The kind of thing you describe simply can't be done by changing the SMTP server, in 2.5 hours. The best you could get is if there was some kind of abuse or security related articulation point in the system, with fast pushes as required by the problem domain but still with the sufficient power to either prevent the requests from reaching the SMTP server at all, or intercept and change the response.
As a trivial example, something like blocking the SMTP port with a firewall rule could have been viable. Though it has the cost of degrading performance for everyone rather than just the affected requests.
My mail server logs show about 20 failures in all of the last week until yesterday 20:43 CET, then 350 failures between 20:43-00:21, then nothing after that. So fair enough, from the client side rather than the status page it looks like 3.5 hours rather than 2.5.
But still, given that resolution time, the suggested solution of changing the SMTP server is absolutely ludicrous.
There is no way that putting in a hardcored hack like that would have been faster. Making the change is, of course, fast.
But then you need to review it (and this is a super risky change, so the review can't be rubber stamped). Build a production build and run all your qualification tests. (Hope you found all the tests that depend on permanent errors being signalled properly). And then roll it out globally, which again is a risky operation, but with the additional problem that rolling restarts simply can't be done faster than a certain speed since you can only restart so many processes at once while still continuing to serve traffic.
The kind of thing you describe simply can't be done by changing the SMTP server, in 2.5 hours. The best you could get is if there was some kind of abuse or security related articulation point in the system, with fast pushes as required by the problem domain but still with the sufficient power to either prevent the requests from reaching the SMTP server at all, or intercept and change the response.
As a trivial example, something like blocking the SMTP port with a firewall rule could have been viable. Though it has the cost of degrading performance for everyone rather than just the affected requests.