> Isn't listening for spikes in complaints about outages a great way to detect them?
When there's a major ISP outage, people report problems with all the major sites. When Facebook's down, people report problems with any site that has "Login with Facebook" as an option.
It's almost never actually an outage impacting all of FAANG at once.
> It's almost never actually an outage impacting all of FAANG at once.
Exactly. If you click through down detector when things are _up_ you'll see people still complaining that $site is down. Could be a local power outage or even a flaky connection in their own home.
Down Detector is one of many signal sources and should have a "credibly" score associated with it that's proportional to the number of people complaining that something's down.
I can guarantee you with 100% confidence from experience that the call centers for AT&T, T-Mobile, Comcast, etc. are all blowing up right now because of users who assume that if the Instagram app isn’t loading it means the “wifi” is broken. Also keep in mind “wifi” doesn’t mean 802.11, it means “anything related to the internet” up to and including 4g/5g and Ethernet.
Heh, as soon as I saw Instagram failing to load, I immediately assumed it was Roger’s fault. They just suck when it comes to reliability and Instagram has a much better track record.
The important step is to filter downdetector from your consciousness. It only exists as rage/cable news bait and nothing more. It is not a useful tool, it’s just a clever way to serve AdWords iFrames.
> do you really think there are masses of people who can’t tell the difference between a single sign on service being down and individual sites being down and reporting it to downdetector?
Absolutely without a doubt.
99.9% of people don’t know what single sign on means or how it works.
> do you really think there are masses of people who can’t tell the difference between a single sign on service being down and individual sites being down and reporting it to downdetector?
Ahh, I see. In that case, most of DownDetectors data are from Twitter and other sources, not first party reporting, although even in the case of first party data, it is also sourced via "visits to DownDetector" which can be from a simple Google search for "is Instagram down?"
If DownDetector relied primarily on direct reporting, they'd be the last to know.
> do you really think there are masses of people who can’t tell the difference between a single sign on service being down and individual sites being down and reporting it to downdetector?
Yes, absolutely. 100%.
> Even if there were doesn’t the outage graph give you exactly the information your asking be curated?
> When Facebook's down, people report problems with any site that has "Login with Facebook" as an option.
If users log into your site with Facebook, then the login functionality of your site effectively is down when "Login with Facebook" is down.
From the user's perspective, your subcontractors, including authentication subcontractors, are a problem for you to deal with and never show them. From your perspective, you could have architected your site in a way that logging in doesn't "go down" when Facebook login is down.
If the user chooses "Login with Facebook" over other authentication options available, and they don't want to use other options, educating them with a good error message might help. Or you could remove the Facebook login option, if you (totally reasonably) don't want Facebook's failures to reflect poorly on you.
> If users log into your site with Facebook, then the login functionality of your site effectively is down when "Login with Facebook" is down.
There are plenty of sites where "Login with Facebook" is a convenience but hardly the only way to log in. Reddit, for example, has "Login with Google" and "Login with Apple"; it would be highly misleading to claim "Reddit is down" if Google's OAuth flow was having an outage.
> educating them with a good error message might help
Nothing in the API or OAuth flow would make that doable in an automatic fashion with this outage. It'd have to be something you put up manually as a banner after hearing of the outage.
> Or you could remove the Facebook login option, if you (totally reasonably) don't want Facebook's failures to reflect poorly on you.
I don't particualrly care; we're talking about why DownDetector isn't necessarily ideal for assessing. It can be a useful signal, in some scenarios, but I've seen plenty of spurious signals come from it.
> Nothing in the API or OAuth flow would make that doable in an automatic fashion with this outage. It'd have to be something you put up manually as a banner after hearing of the outage.
That is fair: if I choose to architect my site such that a user-critical feature goes down when a 3rd party service goes down, it behooves me to monitor the 3rd party service and do whatever necessary to properly inform users what's going on.
I edited my post unfortunately after you replied, but another option is removing the parts of your site that rely on 3rd parties, if you don't want the failures of those 3rd parties to reflect poorly on you (which they reasonably would).
>we're talking about why DownDetector isn't necessarily ideal for assessing. It can be a useful signal, in some scenarios, but I've seen plenty of spurious signals come from it.
Indeed, and if a bunch of users say that a feature of your site is down, even if it's a result of a 3rd party failure: chances are, that part of your site is down, and it's partially your fault for relying on a 3rd party for that feature. The users correctly don't care what the root cause is, they expect you to either mitigate it or don't have a feature they rely upon be unreliable.
Ignore the comments on DownDetector for a moment and check out that huge spike in reports recently. Clearly something wrong happened with AWS's user experience. That's something AWS needs to resolve, in the eyes of their users.
>The chart shows a big spike this morning, but there was no AWS outage
Are you sure? If hundreds of users simultaneously reported there was some sort of outage, particularly a huge spike like we saw, chances are there was an outage.
>Again, DownDetector can be a useful "is something unusual happening right now" signal
Exactly! Specifically, "is something unusual happening right now with my site, in the eyes of my users?" Every site owner should know when that condition is true. What you think about your site "up-ness" isn't as important as what your users think about your site "up-ness". What you attribute your downtime to, isn't as important as what your users attribute your downtime to (you.)
> Clearly something is going on with AWS's user experience.
But that's not the case. It's a false positive.
Pick a DownDetector service and open the page every day for a few days. You'll see it most of the time just reflects people waking up in the US timezones.
Is it a false positive, though? The data shows there was an outage. We would need more evidence to conclude hundreds of users, at that 1 spike, weren't actually having issues.
In other words, we have hundreds of people saying there was an outage, and 1 person saying there wasn't.
That's a problem AWS needs to resolve, regardless of what they think might be the root cause. If the users weren't experiencing any issues with AWS, I doubt they'd be reporting it.
Your comment about timing is a good point: if people are working with AWS early in the day, and AWS is giving them problems, then they will probably report problems with AWS early in the day. I wouldn't expect them to report problems while they're sleeping.
Hundreds of users, representing more users who didn't bother reporting, say they experienced issues when interacting with AWS this morning, so we'll need better evidence to the contrary to conclude otherwise.
The fact that some people accessed AWS without reporting issues does not mean that all people did. For those who had issues, AWS is responsible for dealing with those perceptions.
Indeed, it could have been a fault that affected a subset of users, for example 1 service in 1 availability zone. That's still an outage in the eyes of users, which AWS is responsible for managing. It could have been an issue with a route from 1 ISP. That's still an outage in the eyes of users, which AWS is responsible for managing.
An even better example is the DownDetector page for Facebook, with hundreds of thousands of reports. Do we really think there's no correlation between what DownDetector reports and what users experience?
tl;dr: what users think about your site is more important than both what you think about your site and the reality of your site, and you should be tracking it.
> When there's a major ISP outage, people report problems with all the major sites. When Facebook's down, people report problems with any site that has "Login with Facebook" as an option.
Yes? That's how all top-level reporting is going to work. It's not going to tell you which part of your service is inaccessible. It's just telling you that people can't access it. You obviously have to do additional investigation to figure out why people are having trouble.
Even here on HN, where people should know better, people take its incorrect attribution as useful info. TikTok isn't down. X isn't down. Google isn't down.
I would completely agree that people are bad at interpreting Down Detector-type results, but that doesn't mean it isn't providing a very useful signal.
Indeed I haven't noticed any blip in functionality, but then again I don't ever do FB (or other external service) login. Absolutely no reason to do so, long term drawbacks are too serious to be lazy about this.
That's kinda the point though isn't it? DownDetector is showing an early indication of a major outage in both of your examples. The issue may not be caused by the indicated service, but it's still a useful information source especially when we can correlate reports on there with what we are seeing in our internal monitoring.
A big spike on DownDetector is an indication of something going on.
Its attribution of what/who is often incorrect. You'll see "maybe it's more than Big Site X!" comments come up on every HN thread like this citing DownDetector; it's almost never the case, and folks on HN should know better.
When there's a major ISP outage, people report problems with all the major sites. When Facebook's down, people report problems with any site that has "Login with Facebook" as an option.
It's almost never actually an outage impacting all of FAANG at once.