Hacker News new | past | comments | ask | show | jobs | submit login

"Amazon Secure Token Service (STS) experienced elevated latencies"

I was getting 503 "service unavailable" from STS during the outage most of the time I tried calling it.

I guess by "elevated latency", they mean from anyone with retry logic that would keep trying after many consecutive attempts?




STS is the worst with this. Even for other internal teams, they seem to treat dropped requests (ie, timeouts which represent 5xxs on the client side) as 'non faults', and so don't treat those data points in their graphs and alarms. It's really obnoxious.

AWS in general is trying hard to do the right thing for customers, and obviously has a long ways to go. But man, a few specific orgs have some frustrating holdover policies.


> AWS in general is trying hard to do the right thing for customers

You are responding to a comment that suggests they're misrepresenting the truth (which wouldn't be the first time even in last few days) in communication to their customers.

As always, they are doing the right thing for themselves only.

EDIT: I think that you should mention being an Engineer at Amazon AWS in your comment.


It was very clear from their post that they were criticising STS from the perspective of an engineer in AWS within a different team.


I assumed in good faith that this is someone knowing internals as a larger customer, not an AWS person shit-talking other AWS teams.

Got curious only after a downvote hence late edit. My bad.


> ...an AWS person shit-talking other AWS teams [in public].

I remember a time when this would be an instant reprimand... Either amzn engs are bolder these days, or amzn hr is trying really hard for amzn to be "world's best employer", or both.


Gotta deanonymize the user to reprimand them. Maybe i am wrong here, but i don’t see it as something an Amazon HR employee would actually waste their time on (exceptions apply for confidential info leaks and other blatantly illegal stuff, of course). Especially given that it might as well be impossible, unless the user incriminated themselves with identifiable info.


It's true that I shouldn't have posted it, was mostly just in a grumpy mood. It's still considered very bad form. I'm not actually there anymore, but the idea stands.


I suppose all outages are just elevated latency. Has anyone ever had an outage and said "fuck it, we're going out of business" and never came back up? That's the only true outage ;)


5xx errors are servers or proxies giving up on requests. Increased timeouts resulting in successful requests may have been considered "elevated latency" (but rarely this would be a proper way to solve similar issue).

They treat 5xx errors as non-errors but this is not the case with rest of the world. "Increased timeouts" is Amazon's untruthful term for "not working at all".


So many lessons in this article. When your service goes down but eventually gets back up, it's not an outage. It's "elevated latency". Of a few hours, maybe days.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: