response = fetch(url, payload)
if (response.error) ...
but 99% of the folks I ask what is going to happen when the fetch does NOT error out but instead takes 10 seconds look at me like I'm speaking gibberish.
This is the single biggest reason for cascading failures I see.
Netflix has dealt with it via their Hystrix library (open source). These days it seems like a proxy like Consul is the way to go. It encapsulates all of the fancy logic (like circuit breakers and flow control) so your service doesn't have to.
> Or is it time someone started work on a distributed operating system?
For stuff like this, we've had it since the 80's: Erlang's (and it's sleek offspring, Elixir) BEAM VM. A virtual machine with concurrent, parallel, and distributed systems in mind? Check. A standard library containing batteries-included solutions to most design and technical challenges you'll run into while building such systems? Check. Tooling for stuff like deployments, diagnostics of running systems, and the ability to pull open a REPL for hands-in-the-meat debugging? Check, Check, Check.
That's kind of what HP NonStop is, a distributed operating system operating as a huge cluster.
If you followed their coding practices and used their native libraries you could almost always do things like freeze a process, move it to an entirely different CPU (which could be a totally different physical server) then restart it without losing the work in progress, processes could auto-restart and resume from the last checkpoint, add more processes to handle the messages in the queue and all kinds of built into the OS and layered services niceties that everyone keeps reinventing.
> So do we all have to keep reinventing these wheels, but only after a production outage?
We live in a world where programmers' "consensus" is that checked exceptions are bad and we need to remove them from Java. People generally just don't care anything except the happies path.
That's not the reason to remove checked exceptions from Java. The reason is that exceptions don't compose - they don't play well with functional style code which otherwise works pretty well in Java.
Fun fact: you can actually parameterize functions and types over exception types in Java, <T extends Throwable> ... throws T will type check as expected.
Of course, if you solve a problem using generics, you will now have 2 problems instead of 1...
Does it work for a variable list of things, or would supporting two different checked exceptions require two generic params? If it's one per param I think that's neat but probably limited in use to "your exception can contain a generic value", like `throw NewInsertFailedException(value)`.
I think you might be able to trick the type checker to accept union types, but I'm not sure. I know intersection types are possible, but they are not really useful for exceptions.
Regarding practicality - I've used this feature when implementing a visitor class API, so it definitely has some use cases.
The biggest problem is that all it takes is a single method in the chain which does not support this pattern (think java.util.stream). For internal code it's pretty easy to decorate all functions that take callback lambdas, etc.
And stuff like the Stream api does not use these generics, so you end up wrapping exceptions in RuntimeException anyway, which... again defeats the point of checked exceptions.
"So do we all have to keep reinventing these wheels, but only after a production outage?"
Lotta cynical replies, and mine is going to sound like one of them at first, but I actually mean it in a relatively deep and profound way: Time is hard. You can even see it in pure math, where Logic is all fun and everyone's having a great time being clever and making all sorts of exciting systems and inferences in those systems... and then you try to build Temporal Logic and all the pretty just goes flying out the door.
Even "what if the reply takes ten seconds" is the beginning. By the very nature of the question itself I can infer the response is expected to be small. What if it is large? What if it might legitimately take more than ten seconds to transfer even under ideal circumstances, but you need to know that it's not working as quickly as possible? Is your entry point open to the public? How does it do with slowloris attacks [1]? What if your system simply falls behind due to lack of resources? The difference between 97% capacity and 103% capacity in your real, time-bound systems can knock your socks off in ways you'd never model in an atemporal system that ignored how long things take to happen.
Programming would be grungy enough even if we didn't have these considerations, but I'm not even scratching the surface on the number of ways that adding time as a real-world consideration complexifies a ton of things. Our most common response is often just to ignore it. This is... actually often quite rational, a lot of the failure cases can be feasibly addressed by various human interventions, e.g., while writing your service to be robust to "a slow internal network" might be a good idea, there's also a sense in which the only real solution is to speed up the internal network. But still, time is always sitting there crufting things up.
One of my favorites is the implicit dependency graph you accidentally start creating once your business systems guys start doing "daily processes" of this and that. We're going to do a daily process to run the bills, but that depends on the four daily dumps that feed the billing process to all have been done first. By the way, did you check that the dumps are actually done and not actually in progress as you're trying to use them? And those four daily dumps each have some other daily processes behind them, and if you're not very careful you'll create loops in those processes which introduce all sorts of other problems... in the end, a set of processes that in perfect atemporal logic land wouldn't be too difficult to deal with becomes something very easy to sleepwalk into a nightmare world, where your dump is scheduled to run between 2:12 and 2:16 and it damned well better not fail for any reason, in your control or out of it, or we're not doing billing today. (Or even the nightmare world where your dump is scheduled to run after 3pm but before 1pm every day... that is, these dependency graphs don't have to get very complicated before literally impossible constraints start to appear if you're not careful!) Trying to explain this to a large number of teams at every level of engineering capability level (frequently going all the down to "a guy who distrusts and doesn't like computers who, against his will, maintains a spreadsheet, which is also one of the vital pillars of our business") is the sort of thing that may make you want to consider becoming a monk.
I believe that, in terms of firm theory and how technology plays into organization side, we're reaching the limits of current paradigms. Over the last three to four decades, transactions costs grew (more regulations on personal data, more complicated cross-borders contracts as services became dominant in most economies - free trade agreements typically cover goods but not services) while coordination costs fell (most business facing software can now be used as a metered service in the browser). This favored growing corporations.
I've seen in my lifetime conglomerates fall out of favor ('synergies' failed to materialize) and then rise up again but this time in the computer technology sector - are you in the Apple, Microsoft or Google corporate tech garden?
But now interest rates are back and investors can't just park wealth in businesses that just grow revenue but not profit. So ballooning complexity can't just be dealt with by throwing bodies (and pay raises) at the problem anymore.
I hope this leads to more niche player offerings and less saas where small local outfits are just independent sales outfits for cloud borgs.
Once. It's more work once instead of over and over. That's the point of operating systems, standard libraries, and modules!
I see this weird back-lash in modern development against having common, standard platforms. I suspect it comes from the Python and JavaScript world, where having "no batteries included" is seen as a good thing, instead of a guaranteed mess of dozens of half-complete incompatible frameworks.
I'm coming from the perspective of Windows and comparing it to, say, Azure or AWS. All three have some concepts of access control, log collection, component systems, processes, etc...
But all three are proprietary. Kubernetes goes a long way, but it isn't a user-mode system that can be directly accessed from code. Compare with Service Fabric, which has a substantial SDK component that integrates into the applications.
As an example, here's a really basic thing that is actually absurdly difficult to solve well: web application session state.
If you have sticky load balancing using cookies, then the session state is accessed on one VM something like 99.99% of the time... except for that 1% of the time when it isn't. This could be due to a restart, load rebalancing, or whatever.
If you put the session state into something external like Redis, then a zone-redundant deployment will eat a ~1ms delay on every page render, every time.
Service Fabric uses a model where it keeps three replicas of the state: one in the original web server, and two replicas in elsewhere. This way, reads are in process on the same VM most of the time, resulting in nanosecond latencies. Writing the state can occur asynchronously after the page response is already being sent.
I'd like to see concepts like this, along with all sorts of service-to-service communication patterns, consolidated into an "operating system like platform" designed for the mid-2020s clouds instead of 1990s server farms.
We’re getting there but it takes time to agree on what the best implementation of a reinvented wheel looks like? A good example is OpenTelemetry, which is an obvious idea in hindsight but looks like it will take about a decade to ship.
Or how we move the goalposts when we reach a goal, for example Kubernetes standardized certain aspects of cloud but now that we have that, instead of celebrating we bemoan its complexity and lack of utility at solving actual application or organization challenges such that we still need to use cloud APIs plus container images plus all this other complexity. But hey, we did solve the problem of distributing code to run on machines, it’s just in hindsight it doesn’t seem like it was that hard? We adjust pretty quick to the “new normal” when it’s not even a decade yet since Docker and Kubernetes appeared on the scene.
> I see this weird back-lash in modern development against having common, standard platforms. I suspect it comes from the Python and JavaScript world, where having "no batteries included" is seen as a good thing, instead of a guaranteed mess of dozens of half-complete incompatible frameworks.
Kind of odd to have Python included there as Python's motto for years was (is?) literally "Batteries Included".
" I see this weird back-lash in modern development against having common, standard platforms. "
I think it has always been that way.
It comes down to personality types. Many devs I've met think that the implementation they wrote themselves is simpler and easier to understand vs learning a platform api or existing library.
They tend to shrug off when I point out security or other potential problems
At least in web development rolling your own is usually the pragmatic choice. It won't break opaquely upon update, you can fix it yourself, it only does what you need. Library and platform updates have a much higher chance of breaking something because of the large impact surface, feature updates being conflated with security updates, insufficient testing, and such breakages are much harder to resolve because they are a black box to you. Really nothing to do with personalities.
As a capacity planner I tried to argue in favor of tools like Hystrix being built into our middleware services because when we had large IPPV events cascading failures was the biggest risk to our availability and it happened because 99% of the time our services could process any queues before downstream timeouts occured but during high volume events the queues would grow due to nearly instant demand occurring (normally within 5 minutes of the PPV event start time) and causing the queues to go deeper than our timeouts. Combine that with the queues not being durable if a process restart was needed and things got real ugly real fast under extreme load. Automatic retries + deep queues + short timeouts = service issues that take hours to unwind often requiring a coordinated cold restart of the entire middleware pipeline and millions in lost revenue.
To compensate we had to scale our systems for the absolute instantaneous peak demand because being legacy systems (pre containers) with a lot of rigid plumbing in them we couldn't just scale on demand. Things did not degrade gracefully once you hit the timeout limits in the composite API calls.
True but such foundational code is rarely kept synchronous in prod, you'd usually have a coroutine (or equiv) and await the reply, or timeout after X seconds? Installing a random, kinda crappy, third party service like Consul seems overkill imo
Depending on your request rate and how much higher the latency is than usual, I could see wanting a tool with full-system visibility rather than just independent timeouts at the call site.
Some examples of holistic problems:
How many partially-processed requests can you hold in memory at once as they pile up under that delay?
If the downstream service suddenly fulfills all the pending requests at once, does the thundering herd cause your service to overload other systems?
FWIW, the single paragraph about "fair allocation" could be its own thesis. This gets into quality of service, active queue management, leaky buckets, deficit round robin, and so on ad infinitum. I did quite a bit of work on this on multiple projects at multiple companies, and it's still one of the very few algorithmic areas that I still think about in retirement. I highly recommend following up on some of the terms above for some interesting explorations.
If your clients don't implement an increasing backoff retry with jitter, you can fake it by making your server begin to randomly and increasingly time-out connections or wait to accept them. You can do it a few different ways.
For already-open connections:
1) Keep the connection open and respond with a tiny amount of data every once in a while, so the connection doesn't time out, but make it long enough that a complete request will take forever.. Applications may still re-connect if a successful request-response doesn't happen within a timeout.
2) Keep the connection open but don't respond. The application may time out the connection and re-connect if it receives no data.
3) Drop the connection but don't let the tcp/ip stack send a RST, FIN, or anything else. The applications' connection will be timed out pretty soon, either by the OS stack or an application timeout.
4) Respond to requests with HTTP response codes that will make the client retry. As long as this only happens at the load-balancer level, it will still remove pressure from your application layer.
For new connections:
1) Play hard to get. During the 3-way handshake, respond with weird states that won't cause the connection to drop, but will make the client keep trying to connect, like the handshake is still trying to succeed but experiencing loss.
2) Do a normal 3-way handshake, but wait forever on each step to drag it out.
All of these options are a terrible idea for user-facing applications. They can cause more issues on your server side due to connections in wait state, and your clients will just see requests hanging, stalling, etc. You could just pretend the problem is with their ISP rather than your end. Or you could throw a 429 or 503 to try to stop the retries, but then they really know it's your fault.
The best option is to just add capacity. Back in the day we couldn't do that; you had the servers you had. So when all 400 machines in our colo were thrashing, we just had to shed load randomly, sometimes using the tricks above. Now with the cloud you can magically add capacity anytime, automatically. Much better option.
What's often much better is to add another caching layer before the application layer and just return stale responses during pressure events. But sometimes you can't, so you resort to either a 429 or 503, or the dirty tricks above.
And in addition, we are investing in the graceful-js library to handle 429 and 523 codes returned by the Aperture system - https://github.com/fluxninja/graceful-js
I learned most of this the hard way as a SRE. How systems behave at and over their limits is far more important than how they behave under them. A system that is 'forgiving' (aka resilient) is worth its weight in gold. Otherwise you get into downward spirals with systems that can't recover unless they are rebooted. Great read!
I agree with all this. "Metastable Failures in Distributed Systems" (2001) is another good read if you're facing problems in this vein. https://news.ycombinator.com/item?id=28750103
From my armchair, I'm not sure that "random drop" actually does decrease latency. Most clients will just repeat the request, resulting in an "effective latency" of however many times it gets randomly dropped. The queue is now implicit, and I'd guess that it's less efficient to carry out several request/drop cycles than to just leave the client in a straightforward queue.
Ever since I heard of Little[1] it's been surprising to me how few working programmers know that queuing theory is basically a solved problem and has been for longer than most working programmers have been alive.
Ah. That's classic queuing theory. It has a problem.
The early work on network congestion came from Kleinrock, who wrote the classic "Queuing theory". Kleinrock did his PhD thesis at MIT on Western Union Plan 55-A, a telegram switching system which can be thought of as Sendmail built out of relays and paper tape. Message switches look like a classic arrival rate / service rate problem. They have little customer-level back-pressure; you can send an email regardless of whether the transmission system is backed up.
So an open-loop analysis works fine.
The ARPANET had flow control on each link. Nothing could send a message until there was a buffer ready to receive it. So no packets were lost due to congestion. All overload is stopped at the sender. That approach is immune to congestion collapse, but not to lockup.
Then came the pure datagram networks, and TCP/IP. Anybody can send an IP datagram any time they want to, regardless of the network status. So overloads and packet loss are possible. TCP uses retransmission to hide that, imperfectly. This introduces a new set of problems, some of which were non-obvious at the time.
Classical queuing theory is open-loop. Arrival rate is considered to be independent of wait time. In the real world, it's not. Not even for store cashiers. If arrival rate exceeds service rate, the line length does not really grow without bound except in desperate situations. Customers leave without buying and take their business elsewhere. If there is no cashier idle time, the line length will increase only until the customer loss rate increases to match. Many retail managers do not get this.
I coined the term "congestion collapse" in 1984.[1] In 1985, I wrote, in my "On Package Switches with Infinite Storage" RFC, "We have thus shown that a datagram network with infinite storage, first-in-first-out queuing, and a finite packet lifetime will, under overload, drop all packets."[2]
Until then, people had been doing mostly classic queuing theory analysis. That's not enough.
Back then, memory was very expensive, and people were obsessing over how much memory was needed in a router. It was felt that adding more memory would solve the congestion problem. I pointed out that wouldn't work. Now that memory is cheap, that problem appears as "bufferbloat".
Those two RFCs started people thinking about this as a closed-loop problem. Van Jacobson later did much work in this area. I was out of it by 1986. Decades later, people are still fussing with the feedback control problems implicit in that result.
As the original poster points out here, this comes up in other situations, especially chains of services. If you get congestion in the middle of the chain, things will not go well, and there's a good chance of something that looks like congestion collapse, where throughput goes to nearly zero. It's better to push congestion out towards the endpoints.
We still don't have good solutions to congestion in the middle of a pure datagram network. What saved the Internet was fiber optic backbones and cheap long-haul bandwidth. There was a period in the 1990s when traffic had built up but backbone bandwidth was still expensive. The long-haul links choked and the Internet had "storms". There used to be an "Internet Weather Center", where you could check on how congested the major routers were.
I also coined the term "fair queuing". That can be a useful technique for services well above the IP datagram level. Don't use a FIFO queue; queue based on who's sending. If some source is sending too much, let them compete against themselves for the service. Others can still get through. This provides resilience against denial of service attacks.
I put that on a web site of mine some years ago, and for two months someone was pounding on it with useless requests without affecting response time for anybody else. (It wasn't an attack, just ineptitude using a public API.)
We have a team of 20 engineers currently working on solving this problem in the context of API requests and service chains. Do you know JMS @ Penn? Asking because he did some work in ATM networks, QoS etc. He is advising us on the project.
Very interesting blog post! Our team has been working intensively in this area for the last couple of years - flow control, load shedding, controllability (PID control), and so on.
We would love feedback from folks reading this blog post!
Disclaimer: I am one of the co-authors of the Aperture project. There are several interesting ideas we have built into this project, and I will be happy to dive into the technical details as well.
Not mine. I use circuit breakers. Although, the circuit breaker will never be hit because of the high load as Kubernetes will be already firing up some new pods by that time.
This is the single biggest reason for cascading failures I see.
Netflix has dealt with it via their Hystrix library (open source). These days it seems like a proxy like Consul is the way to go. It encapsulates all of the fancy logic (like circuit breakers and flow control) so your service doesn't have to.