*"Amazon Secure Token Service (STS) experienced elevated latencies"* I was getti...

WaxProlix · on Dec 11, 2021

STS is the worst with this. Even for other internal teams, they seem to treat dropped requests (ie, timeouts which represent 5xxs on the client side) as 'non faults', and so don't treat those data points in their graphs and alarms. It's really obnoxious.

AWS in general is trying hard to do the right thing for customers, and obviously has a long ways to go. But man, a few specific orgs have some frustrating holdover policies.

hericium · on Dec 11, 2021

> AWS in general is trying hard to do the right thing for customers

You are responding to a comment that suggests they're misrepresenting the truth (which wouldn't be the first time even in last few days) in communication to their customers.

As always, they are doing the right thing for themselves only.

EDIT: I think that you should mention being an Engineer at Amazon AWS in your comment.

tybit · on Dec 11, 2021

It was very clear from their post that they were criticising STS from the perspective of an engineer in AWS within a different team.

hericium · on Dec 11, 2021

I assumed in good faith that this is someone knowing internals as a larger customer, not an AWS person shit-talking other AWS teams.

Got curious only after a downvote hence late edit. My bad.

ignoramous · on Dec 11, 2021

> ...an AWS person shit-talking other AWS teams [in public].

I remember a time when this would be an instant reprimand... Either amzn engs are bolder these days, or amzn hr is trying really hard for amzn to be "world's best employer", or both.

filoleg · on Dec 11, 2021

Gotta deanonymize the user to reprimand them. Maybe i am wrong here, but i don’t see it as something an Amazon HR employee would actually waste their time on (exceptions apply for confidential info leaks and other blatantly illegal stuff, of course). Especially given that it might as well be impossible, unless the user incriminated themselves with identifiable info.

WaxProlix · on Dec 11, 2021

It's true that I shouldn't have posted it, was mostly just in a grumpy mood. It's still considered very bad form. I'm not actually there anymore, but the idea stands.

jrockway · on Dec 11, 2021

I suppose all outages are just elevated latency. Has anyone ever had an outage and said "fuck it, we're going out of business" and never came back up? That's the only true outage ;)

hericium · on Dec 11, 2021

5xx errors are servers or proxies giving up on requests. Increased timeouts resulting in successful requests may have been considered "elevated latency" (but rarely this would be a proper way to solve similar issue).

They treat 5xx errors as non-errors but this is not the case with rest of the world. "Increased timeouts" is Amazon's untruthful term for "not working at all".

comboy · on Dec 11, 2021

So many lessons in this article. When your service goes down but eventually gets back up, it's not an outage. It's "elevated latency". Of a few hours, maybe days.