>You can take it on bad faith It's smart politics -- I don't blame them but I do...

toss1 · on Dec 11, 2021

>>It's smart politics -- I don't blame them

Um, so you think straight-up lying is good politics?

Any 7-year old knows that telling a lie when you broke something makes you look better superficially, especially if you get away with it.

That does not mean that we should think it is a good idea to tell lies when you break things.

It sure as hell isn't smart politics in my book. It is straight-up disqualifying to do business with them. If they are not honest about the status or amount of service they are providing, how is that different than lying about your prices?

Would you go to a petrol station that posted $x.00/gallon, but only delivered 3 quarts for each gallon shown on the pump?

We're being shortchanged and lied to. Fascinating that you think it is good politics on their part.

efitz · on Dec 11, 2021

You don’t know what you’re talking about.

AWS spends a lot of time thinking about this problem in service to their customers.

How do you reduce the status of millions of machines, the software they run, and the interconnected-ness of those systems to a single graphical indicator?

It would be dumb and useless to turn something red every single time anything had a problem. Literally there are hundreds of things broken every minute of every day. On-call engineers are working around the clock on these problems. Most of the problems either don’t affect anyone due to redundancy or affect only a tiny number of customers- a failed memory module or top-of-rack switch or a random bit flip in one host for one service.

Would it help anyone to tell everyone about all these problems? People would quickly learn to ignore it as it had no bearing on their experience.

What you’re really arguing is that you don’t like the thresholds they’ve chosen. That’s fine, everyone has an opinion. The purpose of health dashboards like these are mostly so that customers can quickly get an answer to “is it them or me” when there’s a problem.

As others on this thread have pointed out, AWS has done a pretty good job of making the SHD align with the subjective experience of most customers. They also have personal health dashboards unique to each customer, but I assume thresholding is still involved.

Karunamon · on Dec 11, 2021

>How do you reduce the status of millions of machines, the software they run, and the interconnected-ness of those systems to a single graphical indicator?

A good low-hanging fruit would be, when the outage is significant enough to have reached the media, you turn the dot red.

Dishonesty is what we're talking about here. Not the gradient when you change colors. This is hardly the first major outage where the AWS status board was a bald-faced lie. This deserves calling out and shaming the responsible parties, nothing less, certainly not defense of blatantly deceptive practices that most companies not named Amazon don't dip into.

efitz · on Dec 13, 2021

Human-in-the-loop != lying.

Broken dashboard != lying.

The specific charge of “lying” is what I dispute.

toss1 · on Dec 11, 2021

>>How do you reduce the status of millions of machines, the software they run, and the interconnected-ness of those systems to a single graphical indicator?

There's a limitless variety of options, and multiple books written about it. I can recommend the series "The Visual Display of Quantitative Information" by Edward Tufte, for starters.

>> Literally there are hundreds of things broken every minute of every day. On-call engineers are working around the clock...

Of course there are, so a single R/Y/G indicator is obviously a bad choice.

Again, they could at any time easily choose a better way to display this information, graphs, heatmaps, whatever.

More importantly, the one thing that should NOT be chosen is A) to have a human in the loop of displaying status, as this inserts both delay and errors.

Worse yet, to make it so that it is a VP-level decision, as if it were a $1million+ purchase, and then to set the policy to keep it green when half a continent is down... ummm that is WAAAYYY past any question of "threshold" - it is a premeditated, designed-in, systemic lie.

>>You don’t know what you’re talking about. Look in the mirror, dude. While I haven't worked inside AWS, I have worked in complex network software systems and well understand the issues of thousands of HW/SW components in multiple states. More importantly, perhaps it's my philosophy degree, but I can sort out WHEN (e.g., here) the problem is at another level altogether. It is not the complexity of the system that is the problem, it is the MANAGEMENT decision to systematically lie about that complexity. Worse yet, it looks like those lies on an everyday basis are what goes into their claims of "99.99+% uptime!!" evidently false. The problem is at the forest level, and you don't even want to look at the trees because you're stuck in the underbrush telling everyone else they are clueless.

sorry_outta_gas · on Dec 11, 2021

That's only useful when it's an entire region, there are minor issues in smaller services that cause problems for a lot of people they don't reflect in their status board; and not everyone checks twitter or HN all the time while at work

it's a bullshit board used fudge numbers when negoaiting SLAs

like I don't care that much, hell my company does the same thing; but let's not get defensive over it