Running out of file handles and other IO limits is embarrassing and happens at e...

solatic · on Nov 28, 2020

> The problem with growing as fast as Amazon has is that their talent bar couldn’t keep up. I can’t imagine this design being okay 10 years ago when I was there.

I see where you're coming from with this, but you really have to wonder. It sounds more like the original architects made implicit assumptions regarding scale that, likely due to original architects and engineers moving on, were not re-evaluated by the current engineers on Kinesis as Kinesis grew. While it may take an hour now for the front-end cache to sync, I find it highly unlikely that it needed that much time when Kinesis first launched.

The process failure here is organizational, one where Amazon failed to audit its current systems in a complete and current manner such that sufficient attention and resources could be paid to a re-architecture of a critical service before it caused the service to fail. Even now, vertically scaling the front-end cache fleet is just a band-aid - eventually, that won't be possible anymore. Sadly, the postmortem doesn't seem to identify the organizational failure that was the true root cause of the outage.

edoceo · on Nov 28, 2020

Oof. My little company is refactoring some five year old architecture design choices. Ugly. Process isn't visible outside the refactor and the work is tedious. Can't imagine what a service refactor is like at A. I bet it sucks

ayberk · on Nov 28, 2020

Disclaimer: I work for AWS, but I have no ties to Kinesis. Opinions are my own.

AWS has more than enough learnings to avoid these "events". The problem is the whole culture is focused on delivering new stuff instead of preventing problems and improving existing systems.

Some folks made these decisions with best intentions. I'm pretty sure they all got promoted and then moved on. Now, people who inherited these systems have no incentive to fix these, because at most you'll receive a pat in the back. I'm also pretty sure that those teams talked about the shortcomings of the current architecture. TODOs must be present somewhere in the backlog.

I don't think this is an AWS specific problem, but we have to start treating people who prevent problems like the heroes they are. Everyone congratulates when you put out a fire. No one gives a damn if you prevent the fire in the first place.

toomuchtodo · on Nov 29, 2020

> I don't think this is an AWS specific problem, but we have to start treating people who prevent problems like the heroes they are. Everyone congratulates when you put out a fire.

This is sort of comforting to hear that Google’s same problems have reached Amazon, in that no tech behemoth is immune to prioritizing promotion and glitz over the maintenance grind.

> No one gives a damn if you prevent the fire in the first place.

Amen.

mooreds · on Nov 29, 2020

> This is sort of comforting

I would personally put emphasis on the "sort of" clause.

If bigcos with tons of resources can't align incentives to build solid software instead of deliver new features, who among us has a chance?

nvader · on Dec 1, 2020

The GP derives comfort from exactly that point--at some point, the behemoths calcify, and the incentives that come with size lead to a slower growth, if not outright decline.

This gives the rest of us a chance to one day have our day in the sun. We can deliver solid software, precisely because we are not big. And then, perhaps one day we may have the choice of growing into a bigco ourselves, or staying small.

ThrowawayR2 · on Nov 29, 2020

> "The problem is the whole culture is focused on delivering new stuff instead of preventing problems and improving existing systems."

The same disease exists at other FAANGs and large tech companies. Nobody ever gets a promo for maintenance work and being on a sustained engineering team is seen as a career dead-end.

schoolornot · on Nov 29, 2020

Another reason to have technical PMs & managers... because even if they don't have meticulous understandings of underlying systems, they can make cases for additional headcount/funding and recognize efforts that will affect tomorrow's bottom lines/issues.

Twirrim · on Nov 28, 2020

> Can't imagine what a service refactor is like at A. I bet it sucks

It's not all that hard. AWS heavily focuses on Service Oriented Architecture approaches, with specific knowledge/responsibility domains for each. It's a proven scalable pattern. The APIs will often be fairly straight-forward behind the front end. With clearly lines of responsibility between components, you'll almost never have to worry about what other services are doing. Just fix what you've got right in front of you. This is an area where TLA+ can come in handy too. Build a formal proof and rebuild your service based on it.

I joined Glacier 9 months after launch, and it was in the band-aid stage. In cloud services your first years will roughly look like:

1) Band-aids, and emergency refactoring. Customers never do what you expect or can predict them to do, no matter how you price your service to encourage particular behaviour. First year is very much keep the lights on focused. Fixing bugs and applying band-aids where needed. In AWS, it's likely they'll target a price decrease for re:invent instead of new features.

2) Scalability, first new feature work. Traffic will hopefully be picking up by now for your service, you'll start to see where things may need a bit of help to scale. You'll start working on the first bits of innovation for the platform. This is a key stage because it'll start to show you where you've potentially painted yourself in to a corner. (AWS will be looking for some bold feature to tout at Re:Invent)

3) Refactoring, feature work starts in earnest. You'll have learned where your issues are. Product managers, market research, leadership etc. will have ideas about what new features to be working on, and have much more of a roadmap for your service. New features will be tied in to the first refactoring efforts needed to scale as per customer workload, and save you from that corner you're painted in to.

Year 3 is where some of the fun kicks in. The more senior engineers will be driving the refactoring work, they know what and why things were done how they were done, and can likely see how things need to be. A design proposal gets created and refined over a few weeks of presentations to the team and direct leadership. It's a broad spectrum review, based around constructive criticism. Engineers will challenge every assumption and design decision. There's no expectation of perfection. Good enough is good enough. You just need to be able to justify why you made decisions.

New components will be built from the design, and plans for roll out will be worked on. In Glacier's case one mechanism we'd use was to signal to a service that $x % of requests should use a new code path. You'd test behind a whitelist, and then very very slowly ramp up public traffic in one smaller region towards the new code path while tracking metrics until you hit 100%, repeat the process on a large region slightly faster, before turning it on everywhere. For other bits we'd figure out ways to run things in shadow mode. Same requests hitting old and new code, with the new code neutered. Then compare the results.

side note: One of the key values engineers get evaluated on before reaching "Principal Engineer" level is "respect what has gone before". No one sets out to build something crap. You likely weren't involved in the original decisions, you don't necessarily know what the thinking was behind various things. Respect that those before you built something as best as suited the known constraints at the time. The same applies forwards. Respect what is before you now, and be aware that in 3-5 years someone will be refactoring what you're about to create. The document you present to your team now will help the next engineers when they come to refactor later on down the line. Things like TLA+ models will be invaluable here too.

xiwenc · on Nov 28, 2020

Indeed your side note should be the first point. I see that very often in practice. As a result, products get rewritten all the time because developers dont want to spend time understanding the current system believing they can do better job than the previous one. They will create different problems. And the cycle repeats.

The not-invented-here (or by-me) syndrome is probably also at play here.

pentlander · on Nov 28, 2020

As another ex-Glacier dev, I disagree with it not being "all that hard" but I agree with everything else. Now I'm curious who this is :)

Twirrim · on Nov 29, 2020

The English guy, if that's enough of a clue. It has been about 4 1/2 years now since I left Glacier, so there's every chance our paths never overlapped.

pentlander · on Nov 29, 2020

Is it PG? If so, I was an intern at the time and you were gone once I joined full time.

Twirrim · on Nov 29, 2020

That's me :)

mooreds · on Nov 29, 2020

> "respect what has gone before"

Ah, Chesterton’s Fence strikes again!

https://en.wikipedia.org/wiki/G._K._Chesterton#Chesterton's_...

edoceo · on Nov 28, 2020

Your side note should be the first point.

roman_sf · on Nov 29, 2020

>> A design proposal gets created and refined over a few weeks of presentations to the team and direct leadership.

Dat soundz like a bank and not a cloud provider.

Twirrim · on Nov 29, 2020

> Dat soundz like a bank and not a cloud provider.

The first stage in making something reliable, sustainable, and as easy to run as possible is to understand the problem, and understand what you're trying to achieve. You shouldn't be writing any code until you've got that figured out, other than possibly to make sure you understand something you're going to propose.

It's good software engineering, following practices learned, overhauled, and refined over decades, that have a solid track record of success. It's especially vital where you're working on something like AWS, Azure, etc. cloud services.

If you leap feet first in to solving a problem, you'll just end up with something that is unnecessarily painful down the road, even in the near term. It's often quicker to do the proposal, get it reviewed, and then produce the product than it is to dive in and discover all the gotchas as you go along. The process doesn't take too long, either.

Every service in AWS will follow similar practices, and engineers do it often enough that whipping up a proposal becomes second nature and takes very little time. Just writing the proposal in and of itself is valuable because it forces you to think through your plan carefully, and it's rare for engineers not to discover something that needs clarified when they write their plan down. (side-note: All of this paperwork is also invaluable evidence for any promotion that they may be wanting, arguably as much as actually releasing the thing to production). It shouldn't take a day to write a proposal, and you'd only need a couple of meetings a few days apart to do the initial review and final review. Depending on the scope of what came up in the initial review, the final review may be a quick rubber stamp exercise or not even necessary at all.

Where I am now, we've got an additional cross-company group of experienced engineers that can also be called on to review these proposals. They're almost always interesting sessions because it brings in engineers who will have a fresh perspective, rather than ones with preconceived notion based on how things currently are.

An anecdote I've shared in greater detail here before: Years ago we had a service component that needed created from scratch and had to be done right. There was no margin for error. If we'd made a mistake, it would have been disastrous to the service. Given what it was, two engineers learned TLA+, wrote a formal proof, found bugs, and iterated until they got it fixed. Producing the java code from that TLA+ model proved to be fairly trivial because it almost became a fill-in-the-blanks. Once it got to production, it just worked. It cut down what was expected to be a 6 months creation and careful rollout process down to just 4 months, even including time to run things in shadow mode worldwide for a while with very careful monitoring. That component never went wrong, and the operational work for it was just occasional tuning of parameters that had already been identified as needing to be tune-able during design review.

In an ideal world, we'd be able to do something like how Lockheed Martin Corps did for the space shuttles: https://www.fastcompany.com/28121/they-write-right-stuff, but good enough is good enough, and there's ultimately diminishing returns on effort vs value gained.

pentlander · on Nov 28, 2020

The hand rolled gossip protocol (DFDD) is not used for consensus, it's just used for cluster membership and health information. It's used by pretty much every foundational AWS service. There's a separate internal service for consensus that uses paxos.

The thread per frontend member definitely sounds like a problematic early design choice. It wouldn't be the first time I heard of an AWS issue due to "too many threads". Unlike gRPC, the internal RPC framework defaults to a thread per request rather than an async model. The async way was pretty painful and error prone.

joneholland · on Nov 28, 2020

Are they still using Coral and Codigo as the RPC stack?

pentlander · on Nov 28, 2020

Yeah Amazon still runs on Coral, there were some recent (release a few years ago) advances on it under the hood and ergonomically. I think the "replacement" for it is Smithy[0] though it will likely just replace the XML schema and codegen and not the protocol. Honestly at this point I think it would be in Amazon's best interest to heavily invest in Java Project Loom rather than trying to convert to async.

[0] https://awslabs.github.io/smithy/

morsma · on Nov 28, 2020

Not Codigo, but Coral, yes.

8note · on Nov 28, 2020

Well, there's still codigo around, but coral is quite pleasent

amf12 · on Nov 29, 2020

Yep, that was my understanding as well. It doesn't seem to be for consensus.

Although, for Frontend servers which just do auth, routing, etc - why is P2P gossip necessary for building shard map? Possibly because retrieval of configuration information directly from the vending service may be a bottleneck - but then why not gossip with a subset of peers than every peer and the vending service which is a source of truth.

ignoramous · on Nov 28, 2020

Are you sure Kinesis uses DFDD [0]?

[0] Seems like a relic of years gone by https://patents.justia.com/patent/9838240

pentlander · on Nov 28, 2020

Though I no longer work for Amazon, I'm reasonably certain they use it from the description. Especially given I know for a fact that other more foundational services use it.

Why is it a "relic of years gone by"? Consul uses a similar, though more advanced technique[0]. Consul may not be as widely used as etcd, but I don't think most would consider it a relic.

[0] https://www.consul.io/docs/architecture/gossip

cowsandmilk · on Nov 28, 2020

That patent is from when Kinesis Data Streams were originally announced to the public. Any reason not to think it uses it. Seems like it would have been a logical choice in the initial architecture and change is slow.

star-trek-fleet · on Nov 28, 2020

I led the storage engine prototyping for Kinesis in 2012 (the best time in my career so far).

Kinesis uses Chain Replication, a dead simple fault tolerante storage algorithm: machines formed a chain, data flow from head to tail in one direction, writes always start at head, and read at tail, new nodes always join at tail, but nodes can be kicked out at any position.

The membership management of chain node is done through a paxos-based consensus service like chubby or zookeeper. Allan [2] (the best engineer I personally worked with so far, way better than anyone I encountered) wrote that system. The Java code quality shows itself after the first glance. Not mentioning the humbleness and openness in sharing his knowledge during early design meetings.

I am not sure what protocol is actually used now. But I would be surprised it's different, given the protocol's simplicity and performance.

[1] https://www.cs.cornell.edu/home/rvr/papers/OSDI04.pdf [2] https://www.linkedin.com/in/allan-vermeulen-58835b/

ryanworl · on Nov 28, 2020

Can you explain why the sequence numbers are so giant? I've never understood that.

star-trek-fleet · on Nov 28, 2020

I dont remember the size, is it 128bits?

It was chosen for future expansion. Kinesis was envisioned to be a much larger-scale Kafka + Storm (storm was the streaming programming framework popular in 2012, it was since falls out of favor).

ryanworl · on Nov 28, 2020

128bit might be accurate, I meant more along the lines of they are non-contiguous and don't seem to be correlated with the amount of records actually being written to a stream.

crgwbr · on Nov 28, 2020

I always assumed they were sequential per-region rather than per-stream, but that’s just a guess.

star-trek-fleet · on Nov 28, 2020

I recall it was sequential per-shard, but no ordering among the shards of the same stream. But I literally haven't touch Kinesis after 2013.

marcinzm · on Nov 28, 2020

I don't think it's about growing fast so much as, from those I talked to, Amazon now has a fairly bad reputation in the tech community. You only go to work there if you don't have a better option (Google, Facebook, etc) or have some specialty skill they're willing to pay for. Pay is below other FAANG companies and the work culture isn't great (toxic even some would say).

edit: They also had the most disorganized and de-centralized interview approach from all the FAANG companies I talked with. Which isn't growing pains this far in, it's just bad management and process.

ActorNightly · on Nov 28, 2020

Just as a general reminder to anyone reading this: forum comments are incredibly biased and hardly ever represent reality accurately.

marcinzm · on Nov 28, 2020

Yup, as I said, just my personal observations based on those I talked to. Your mileage will vary.

imajoredinecon · on Nov 28, 2020

Interesting re interview experience

I interviewed as a new grad SWE and the process was totally straightforward, and way lower friction (albeit much less human interaction, which made it feel even more impersonal) than almost everywhere else I applied: initial online screen, online programming task, and then a video call with an engineer where you explained your answer to the programming task.

marcinzm · on Nov 28, 2020

I was doing machine learning so more specialized than regular SDE. Other companies it was talk to recruiter, phone screen with manager, and then virtual onsite interviews. Hiring was either not team specific or the recruiter helped manage the process (ie: what does this role actually need). Very clear directions on what type of questions will be asked, format of the interviews, what to prepare for, etc. Amazon the recruiter just told me to look on the job site and then, despite me being clear, applied me to the wrong role. Then got one of those automated coding exercises despite 15 year experience and an internal referral. Wasn't hard but it also pointless since the coding exercise was for the wrong role. Finally got a phone screen and they asked me nothing but pedantic college textbook questions for 40 minutes. Recruiter provided no warning for that.

edit: You could blame the recruiter but every other company had a well oiled machine for their recruiters. So even if they provided only generic information there was still a standard process for what they provided.

akhilcacharya · on Nov 28, 2020

The process for new grads and interns is different from industry hires and is decided by team.

mav3rick · on Dec 1, 2020

Why doesn't Amazon have one bar across the company ? Doesn't it lead to varying levels of talent across orgs ?

akhilcacharya · on Dec 1, 2020

That's what bar raisers are for

amf12 · on Nov 29, 2020

> I don't think it's about growing fast so much as, from those I talked to, Amazon now has a fairly bad reputation in the tech community.

My personal observation having known quite a few Amazon SWEs and interviewed them.

The bad rep is only for the junior roles. SWEs who work at AWS and are high L5+ are pretty solid.

akhilcacharya · on Nov 30, 2020

What exactly does high L5+ mean?

amf12 · on Nov 30, 2020

SWEs who weren't just recently promoted to L5 at Amazon. They have some experience at that level. Granted, there could be some bias because it's not easy to pinpoint when they were promoted.

akhilcacharya · on Nov 30, 2020

As a recently promoted L5, what bad rep should I be shielding myself from as a junior?

nixass · on Nov 28, 2020

Very anecdotal

reducesuffering · on Nov 29, 2020

Here's some anecdata that matches up with GP[0]. Note how Amazon is substantially worse-rated than most other companies.

[0]https://www.teamblind.com/company/Amazon/

Disclaimer: Am Amazon employee

jknoepfler · on Nov 28, 2020

It irks me to this day that AWS all-hands meetings (circa 2015) celebrated an exponential hiring curve (as in the graph was greeted with applause and touted as evidence of success by the speaker). The next plot would be an exponential revenue curve with approximately the same shape. Meanwhile the median lifespan of an engineer was ~10 months. I don't know, I just couldn't square that one in my head.

jen20 · on Nov 28, 2020

> What appears to be their own hand rolled gossip protocol (that is clearly terrible compared to raft or paxos)

Raft and Paxos are not gossip protocols - they are consensus protocols.

joneholland · on Nov 28, 2020

Fair. What I meant to say is “hand rolling a way to have consensus on a shared piece of data” by implementing it with a naive gossip system.

bonfire · on Nov 28, 2020

If you want to eat in a restaurant it's better not to look in the kitchen :-|

incognito_limey · on Nov 29, 2020

As someone who is in another high growth start up, one of the fastest in the world (not hyperbole) I wish I could upvote this specific comment more: "The problem with growing as fast as Amazon has is that their talent bar couldn’t keep up. I can’t imagine this design being okay 10 years ago when I was there."

I came to the realization about a year ago, that there are definitely talent tiers and unless you are working super hard at recruiting and paying top dollar, the cliff edge approaches fast and is very very steep.

hintymad · on Nov 28, 2020

AWS frontend services are usually implemented in Java. If Kinesis' frontend does too, then it's surprising that the threads created by a frontend service would exceed the OS limit. This tells three possibilities: 1. Kinesis did not impose a max thread count in their app, which is a gross omission; 2. Or there was a resource leak in their code. 3. Each of their frontend instances stored all the placement information of backend servers, which means their frontend was not scalable by backend size.

joneholland · on Nov 28, 2020

My understanding is that every front end server has at least one connection (on a dedicated thread) to every other front end server.

Assuming they have say, 5000 front end instances, thats 5000 file descriptors being used just for this, before you are even talking about whatever threads the application needs.

It’s not surprising that they bumped into ulimits, though as part of OS provisioning, you typically have those tuned for workload.

More concerning is the 5000 x 5000 amount of open tcp sessions across their network to support this architecture. This has to be a lot of fun on any stateful firewall it might cross.

foota · on Nov 29, 2020

The tcp connections are probably not an issue, working in cloud it's never something I've seen worried about, so maybe the architecture doesn't have that limitation?

joneholland · on Nov 29, 2020

Any time you traverse a firewall or a NAT you’ll run into issues if you have a massive amount of open sessions and you are dealing with a stateful FW.

Just because you haven’t encountered it doesn’t mean it’s not there, it’s probably just properly tuned and balanced for the load.

foota · on Nov 29, 2020

Yeah, I believe that. I'm not sure this applies to something like AWS though, where firewall like capability is provided via a layer spread over thousands or greater of instances.

joneholland · on Nov 29, 2020

The virtual firewalls running on the virtual networks in AWS for their customers are not the same as the layer 2 firewalls that exist in their data centers internally.

fafner · on Nov 28, 2020

Yep. Having each front end needing to scale with the overall size of the front end sounds is obviously going to hit some scaling limit. It's not clear to me from the summary why they are doing that. If it's for the shard-map or cache? Maybe if the front end is stateful that's a way to do stickiness? Seems we can only guess.

karmakaze · on Nov 28, 2020

Kinesis was the worst AWS tech I've ever used. Splitting a stream into shards doesn't increase throughput if you still need to run the same types/number of consumers on every shard. The suggested workaround at the time was to use larger batches and add latency to the processing pipeline.

hedora · on Nov 28, 2020

I’ve noticed a strong tendency for older systems to accumulate “spaghetti architecture”, where newer employees add new subsystems and tenured employees are blind to the early design mistakes they made. For instance, in this system, it sound like they added a complicated health check mechanism at some point to bounce faulty nodes.

Now, they don’t know how it behaves, so they’re afraid to take corrective actions in production.

They built that before ensuring that they logged the result of each failed system call. The prioritization seems odd, but most places look at logging as a cost center, and the work of improving it as drudgery, even though it’s far more important than shiny things like automatic response to failures, and also takes a person with more experience to do properly.

throwaway189262 · on Nov 28, 2020

I can't remember which db, but somebody a while back claimed that one of Amazon's "infinitely scalable" dbs was tons of abstraction on top a massive heap of MySQL instances.

I don't trust anything outside core services on AWS. Regardless of whether the rumor I heard is true, it's clear they appreciate quantity over quality.

rubiquity · on Nov 28, 2020

Disclosure: I work at AWS, possibly near the system you're describing. Opinions are my own.

If we're talking about the same thing then I think casting stones just because it is based on MySQL is severely misguided. MySQL has decades of optimizations and this particular system at Amazon has solved scaling problems and brought reliability to countless services without ever being the direct cause of an outage (to the best of my knowledge).

Indeed, MySQL is not without its flaws but many of these are related to its quirks in transactions and replication which this system completely solves. The cherry on top is that you have a rock solid database with a familiar querying language and a massive knowledge base to get help from when needed. Oh, and did I mention this system supports multiple storage engines besides just MySQL/InnoDB?

I for one wish we would open source this system though there are a ton of hurdles both technical and not. I think it would do wonders for the greater tech community by providing a much better option as your needs grow beyond a single node system. It has certainly served Amazon well in that role and I've heard Facebook and YouTube have similar systems based on MySQL.

To further address your comment about Amazon/AWS lacking quality: this system is the epitome of our values of pragmatism and focusing our efforts on innovating where we can make the biggest impact. Hand rolling your own storage engines is fun and all but countless others have already spent decades doing so for marginal gains.

panarky · on Nov 28, 2020

> the system you're describing

Why all the mysterious cloak-and-dagger?

rubiquity · on Nov 28, 2020

It is possible that the person I replied to could be talking about an entirely different piece of software. Another reason is that the specifics of the system I'm referring to are not public knowledge.

The more important takeaway is that building on top of MySQL/InnoDB is perfectly fine and that is what I tried to emphasize.

offbyone · on Nov 29, 2020

Especially if it's the one I'm thinking of, too. It's a damn solid architecture.

WookieRushing · on Nov 28, 2020

This is really common and works pretty awesomely. MySQL is extremely battle tested and there’s lots of experts out there for it.

FB built a similar system to maintain their graph: https://blog.yugabyte.com/facebooks-user-db-is-it-sql-or-nos...

It’s a ton of tiny DBs that look like one massive eventually consistent DB

newscom59 · on Nov 28, 2020

What’s the problem with this? Anything that’s scalable is “just an abstraction” on top of a heap of shards/processes/nodes/data centers/whatever.

ignoramous · on Nov 28, 2020

I believe someone allegedly from AWS said DynamoDB was written on top of MySQL (on top of InnoDB, really) [0] which would be similar to what Uber and Pinterest did as well. [1]

[0] https://news.ycombinator.com/item?id=18426096

[1] https://news.ycombinator.com/item?id=12605026

blandflakes · on Nov 28, 2020

At more than one of my jobs, this has been exactly the right way to horizontally scale relational workloads and has gone very well.

derwiki · on Nov 28, 2020

Sounds like Dynamo

rantwasp · on Nov 29, 2020

if this is dynamo, this is just a bunch of BS. you cannot get the latencies DDB has with a ton of abstractions on top of MySql.