You are absolutely correct. That would be a *much* better experience. That said,...

seppel · on March 5, 2024

> That said, getting there strikes me as pretty challenging. Automatically detecting a down state is difficult and any detection is inevitably both error-prone and only works for things people have thought of to check for. The more complex the systems in question, the greater the odds of things going haywire. At Meta's scale, that is likely to be nearly a daily event.

Well, in principle, the frontend just has to distinguish between HTTP status 500 (something broken in the backend, not the fault of the user) and some HTTP status code 4xx (the user did something wrong).

Kalium · on March 5, 2024

Yes, assuming the responses are usefully different, accurate, and you get responses in a timely manner.

seppel · on March 5, 2024

The "your username/password is wrong" message came in a timely manner. So someone transformed "some unforeseen error" into a clear but wrong error message.

And this caused a lot of extra trouble on top of the incident.

matsemann · on March 5, 2024

But there's something off here. I wouldn't expecting to be shown as logged out when the services are down. I'd expect calls to fail with something aka 500 and an error showing "something happen edited on our side". Not all the apps going haywire.

Kalium · on March 5, 2024

At the scale of Meta, "down" is a nuanced concept. You are very unlikely to get every piece of functionality seizing up at once. What you are likely to get is some services ceasing to function and other services doing error-handling.

For example, if the service that authenticates a user stops working but the service that shows the login form works, then you get a complex interaction. The resulting messaging - and thus user experience - depend entirely on how the login page service was coded to handle whatever failure the authentication service offered up. If that happens to be indistinguishable from a failure to authenticate due to incorrect credentials from the perspective of the login form service, well, here we are.

At Meta's scale, there's likely quite a few underlying services. Which means we could be getting something a dozen or more complex interactions away from wherever the failures are happening.

jessriedel · on March 5, 2024

Isn't this just the standard problem of reporting useful error messages? Like, yes, there are academic situations where you can't distinguish between two possible error sources, but the vast majority of insufficiently informative error messages in the real world arise because low effort was applied to doing so.

Kalium · on March 5, 2024

Yes and no.

Yes, with the additions of sheer scale, a vast number of services, multiple layers, and the difficulty of defining "down" added in. I think the difficulty of reporting useful error messages is proportional to the number of places an error can reasonably happen and the number of connections it can happen over, and by any metric Meta's got a lot of those.

No, in that detecting when you should be reporting a useful error message is itself a complex problem. If a service you call gives you a nonsense response, what do you surface to the user? If a service times out, what do you report? How do you do all this without confusing, intimidating, and terrifying users to whom the phrase "service timeout" is technobabble?

jessriedel · on March 5, 2024

> If a service you call gives you a nonsense response, what do you surface to the user?

If this occurred during the authentication process, I think I would tell the user "Sorry, the authentication process isn't working. Try again later." rather than "Invalid credentials". And you could include a "[technical details]" button that the user could click if they were curious or were in the process of troubleshooting.

sandspar · on March 5, 2024

Slightly unrelated question, but just how "Big" is Meta? I know it's vast, but as an outsider I have trouble grokking the scale of it.

pixl97 · on March 5, 2024

When most people talk about serving thousands and maybe millions of requests per second, Meta talks about billions of requests per second.

https://read.engineerscodex.com/p/how-facebook-scaled-memcac...

shkkmo · on March 5, 2024

> If that happens to be indistinguishable from a failure to authenticate due to incorrect credentials from the perspective of the login form service, well, here we are.

If you can't distinguish those, then that is bad software design.

lanstin · on March 5, 2024

Come on use a little imagination. DNS lookup for the db holding the shard with the user credentials disappears. Code isn’t expecting this, throws a generic 4xx because security instead of a generic 5xx (plenty of people writing auth code will take the stance all failures are presented the same as a bad password or non-existing username); caller interprets this a login failure.

Same auth system system used to validate logins to the bastions that have access to DNS. Voilá.

shkkmo · on March 5, 2024

> plenty of people writing auth code will take the stance all failures are presented the same as a bad password or non-existing username

Those people would be wrong. You can take all unexpected errors and stick them behind a generic error message like "something went wrong" but you should not lie to your users with your error message.

jtuple · on March 5, 2024

It's about not leaking sensitive information.

If you have different messages for invalid username vs invalid password, you can exploit that to determine if a user has an account at a particular service.

"Invalid credentials" for either case solves this problem.

But sure, let's report infra failures different as "unexpected error"

Now, what happens if the unexpected error is only when checking passwords, but not usernames?

Do you report "invalid credentials" when given an invalid username, but "unexpected error" when given a valid name but invalid password?

If so, you're leaking information again and I can determine valid usernames.

So, safe approach is to report "invalid credentials" for either invalid data or partial unexpected errors.

Only time you could safely report "unexpected error" is if both username check and password check are failing, which is so rare that it's almost not worth handling. Esp. at the risk of doing wrong and leaking info again.

shkkmo · on March 6, 2024

If you really want to hide whether a username is in use, then you also have to obscure the actual duration of the authentication process among other things. The amount of hoops you need to jump through to properly hide username usage are sufficient that you need to actually consider if this is a requirement or not. Otherwise, it is just a cargo cult security practice like password character requirements or mandated password reset periods.

In this case, Facebook does not treat hiding username usage as a requirement. Their password reset mechanism not only exposes username / phonenumber usage, but ties it to a name and picture. So yes, Facebook returning an error that says credentials are incorrect when it has infrastructure problems is absolutely a defect.

ambichook · on March 5, 2024

what if, if one service doesnt respond at all or responds with something that doesnt fit an expected format that it would if working correctly, the whole thing just says "sorry, we had an error, try again later"? if it has to check both at the same time, and cant check them independently, wouldn't that solve the vulnerability? or am i missing something? totally understandable if i am, i just want to learn /gen