Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Status pages are basically useless if they’re public facing.

Either they automatically update based on automatic tests (like some of the Internet backbone health tests) or they’re manually updated.

If they’re automatic, they’re almost always internal and not public. If they’re manual, they’re almost always delayed and not updated until after the outage is posted to HN anyway.



The other problem with status pages is depending on what happened it may not be possible to update the status page anyway. You really need a third party to have a useful status page.


Which is pretty much what down detector has evolved into. And it looks like they have an enterprise offering to alert companies to their own issues.


Which is better? How do you know whether an issue is individual to a customer or a quick blip that will resolve in a few seconds?


I prefer fully automated tests publicly revealed because the main thing I want to know (as a customer) is should I keep trying to fix my end or give up because GitHub exploded again.

It’s most annoying when you have something like recently - known maintenance work on my upstream home fiber connection that was resulting in service degradation (but not complete loss, my fiber line was back to DSL or dialup). The chat lady could see that my area was affected, but the issue lookup system couldn’t.

If the issue lookup had told me there as an issue I’d’ve gone on my merry way.

I even checked a few more times until it was resolved; the issue never appeared in the issue lookup system.


> should I keep trying to fix my end or give up because GitHub exploded again

Making this decision easy is a fight I fight for my customers every day. :)


This was much much much easier when websites used to explode with tracebacks and other detailed error messages, now you just get a "whoopsie doopsie we did a fuckywucky" and you can't really tell what's going on.


you can't operate at any scale at all without mechanisms in place to know perfectly well whether an issue is impacting a single customer or if your world is on fire


You'd like to think so, but surprisingly large number of "large scale" things operate on the "everything is fine" until too many people complain about the fire.


Caches make problems fun too.

Quite often you see automated tests that check how well your cache/in memory data are working. But when some other customer that isn't in the hot path tries to access their request times out. I've seen a lot of people making automated checking systems fail at things like this.


The phrase “the hardest parts of computer science is caching and naming things” come to mind.


I see 2 things here but you're off by one.


Yes, but those mechanisms take time to determine this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: