> This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks. These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries. This led to persistent congestion and performance issues on the devices connecting the two networks.
I remember my first experience realizing the client retry logic we had implemented was making our lives way worse. Not sure if it's heartening or disheartening that this was part of the issue here.
Our mistake was resetting the exponential backoff delay whenever a client successfully connected and received a response. At the time a percentage but not all responses were degraded and extremely slow, and the request that checked the connection was not. So a client would time out, retry for a while, backing off exponentially, eventually successfully reconnect and then after a subsequent failure start aggressively trying again. System dynamics are hard.
And have to be actually tested. Most of them are designs based on nothing but uninformed intuition. There is an art to back pressure and keeping pipelines optimally utilized. Queueing doesn’t work like you think until you really know.
Why is this hard, and can’t just be written down somewhere as part of the engineering discipline? This aspect of systems in 2021 really shouldn’t be an “art.”
It is, in itself, a separate engineering discipline, and one that cannot really be practiced analytically unless you understand really well the behavior of individual pieces which interact with each other. Most don't, and don't care to.
It is something which needs to be designed and tuned in place and evades design "getting it right" without real world feedback.
And you also simply have to reach a certain somewhat large scale for it to matter at all, the amount of excess capacity you have because of the available granularity of capacity at smaller scales eats up most of the need for it and you can get away with wasting a bit of money on extra scale to get rid of it.
It is also sensitive to small changes so textbook examples might be implemented wrong with one small detail that won't show itself until a critical failure is happening.
It is usually the location of the highest complexity interaction in a business infrastructure which is not easily distilled to a formula. (and most people just aren't educationally prepared for nonlinear dynamics)
It absolutely is written down. The issue is that the results you get from modeling systems using queuing theory are often unintuitive and surprising. On top of that it's hard to account for all the seemingly minor implementation details in a real system.
During my studies we had a course where we built a distributed system and had to model it's performance mathematically. It was really hard to get the model to match the reality and vice-versa. So many details are hidden in a library, framework or network adapter somewhere (e.g buffers or things like packet fragmentation).
We used the book "The Art of Computer Systems Performance Analysis" (R. Jain), but I don't recommend it. At least not the 1st edition which had a frustrating amount of serious, experiment-ruining errata.
Think of other extremely complex systems and how we’ve managed to make them stable:
1) airplanes: they crashed, _a lot_. We used data recorders and stringent process to make air travel safety commonplace.
2) cars: so many accidents accident research. The solution comes after the disaster.
3) large buildings and structures: again, the master work of time, attempts, failures, research and solutions.
If we really want to get serious about this (and I think we do) we need to stop reinventing infrastructure every 10 years and start doubling down on stability. Cloud computing, in earnest, has only been around a short while. I’m not even convinced it’s the right path forward, just happens to align best with business interests, but it seems to be the devil we’re stuck with so now we need to really dig in and make it solid. I think we’re actually in that process right now.
But what's a good alternative then? What if the internet connection has recovered? And you were at the, for example, 4 minute retry loop. Would you just make your users stare at a spinning loader for 8 minutes?
Or tell them directly that "We have screwed up. The service is currently overloaded. Thank you for your patience. If you still haven't given up on us, try again a less busy time of day. We are very sorry."
There are several options, and finding the best one depends a bit on estimating the behaviour of your specific target audience.
I first learned about exponential backoff from TCP and TCP has a lot of other smart ways to manage congestion control. You don't need to implement all the ideas into client logic but you can also do a lot better than just basic exponential backoff.
The problem shows up at the central system while the peripheral device is causing it. And those systems belong to very different organizations with very different priorities. I still remember how difficult the discussion was with 3G basestation team persuading them to implement exponential backoff with some random factor when connecting to the management system.
I remember my first experience realizing the client retry logic we had implemented was making our lives way worse. Not sure if it's heartening or disheartening that this was part of the issue here.
Our mistake was resetting the exponential backoff delay whenever a client successfully connected and received a response. At the time a percentage but not all responses were degraded and extremely slow, and the request that checked the connection was not. So a client would time out, retry for a while, backing off exponentially, eventually successfully reconnect and then after a subsequent failure start aggressively trying again. System dynamics are hard.