> This resulted in a large surge of connection activity that overwhelmed the net...

colechristensen · on Dec 11, 2021

> System dynamics are hard.

And have to be actually tested. Most of them are designs based on nothing but uninformed intuition. There is an art to back pressure and keeping pipelines optimally utilized. Queueing doesn’t work like you think until you really know.

gfodor · on Dec 11, 2021

Why is this hard, and can’t just be written down somewhere as part of the engineering discipline? This aspect of systems in 2021 really shouldn’t be an “art.”

colechristensen · on Dec 11, 2021

It is, in itself, a separate engineering discipline, and one that cannot really be practiced analytically unless you understand really well the behavior of individual pieces which interact with each other. Most don't, and don't care to.

It is something which needs to be designed and tuned in place and evades design "getting it right" without real world feedback.

And you also simply have to reach a certain somewhat large scale for it to matter at all, the amount of excess capacity you have because of the available granularity of capacity at smaller scales eats up most of the need for it and you can get away with wasting a bit of money on extra scale to get rid of it.

It is also sensitive to small changes so textbook examples might be implemented wrong with one small detail that won't show itself until a critical failure is happening.

It is usually the location of the highest complexity interaction in a business infrastructure which is not easily distilled to a formula. (and most people just aren't educationally prepared for nonlinear dynamics)

jval43 · on Dec 11, 2021

It absolutely is written down. The issue is that the results you get from modeling systems using queuing theory are often unintuitive and surprising. On top of that it's hard to account for all the seemingly minor implementation details in a real system.

During my studies we had a course where we built a distributed system and had to model it's performance mathematically. It was really hard to get the model to match the reality and vice-versa. So many details are hidden in a library, framework or network adapter somewhere (e.g buffers or things like packet fragmentation).

We used the book "The Art of Computer Systems Performance Analysis" (R. Jain), but I don't recommend it. At least not the 1st edition which had a frustrating amount of serious, experiment-ruining errata.

dclowd9901 · on Dec 11, 2021

Think of other extremely complex systems and how we’ve managed to make them stable:

1) airplanes: they crashed, _a lot_. We used data recorders and stringent process to make air travel safety commonplace.

2) cars: so many accidents accident research. The solution comes after the disaster.

3) large buildings and structures: again, the master work of time, attempts, failures, research and solutions.

If we really want to get serious about this (and I think we do) we need to stop reinventing infrastructure every 10 years and start doubling down on stability. Cloud computing, in earnest, has only been around a short while. I’m not even convinced it’s the right path forward, just happens to align best with business interests, but it seems to be the devil we’re stuck with so now we need to really dig in and make it solid. I think we’re actually in that process right now.

adrianN · on Dec 11, 2021

Because no two systems are alike and these are nonlinear effects that strongly depend on the details, would be my guess.

pfortuny · on Dec 11, 2021

Exponential behaviour is hard to understand.

Also, experiments at this size and speed have never been carried out before.

And statistical behaviours are very difficult to understand. First thing: 99.9999% uptime for ALL users is HUGELY different from 99.999%.

As a matter of fact, this was just one of amazon’s zones, rememeber.

Edit: finally, the right model for these systems might well have no mean (fat tails…) and then where do the statistics go from there?

ksrm · on Dec 11, 2021

Because """software engineering""" is a joke.

vmception · on Dec 11, 2021

> Most of them are designs based on nothing but uninformed intuition.

Or because they read it on a Google|AWS Engineering blog

vmception · on Dec 11, 2021

Or they regurgitated a bullshit answer from a system design prep course while pretending to think of it on the spot just to get hired

EnlightenedBro · on Dec 11, 2021

But what's a good alternative then? What if the internet connection has recovered? And you were at the, for example, 4 minute retry loop. Would you just make your users stare at a spinning loader for 8 minutes?

kqr · on Dec 11, 2021

Sure, why not?

Or tell them directly that "We have screwed up. The service is currently overloaded. Thank you for your patience. If you still haven't given up on us, try again a less busy time of day. We are very sorry."

There are several options, and finding the best one depends a bit on estimating the behaviour of your specific target audience.

xmprt · on Dec 11, 2021

I first learned about exponential backoff from TCP and TCP has a lot of other smart ways to manage congestion control. You don't need to implement all the ideas into client logic but you can also do a lot better than just basic exponential backoff.

sciurus · on Dec 11, 2021

See for instance the client request rejection probability equation at https://sre.google/sre-book/handling-overload/

heisenbit · on Dec 11, 2021

The problem shows up at the central system while the peripheral device is causing it. And those systems belong to very different organizations with very different priorities. I still remember how difficult the discussion was with 3G basestation team persuading them to implement exponential backoff with some random factor when connecting to the management system.