Hacker News new | past | comments | ask | show | jobs | submit login
Erlang Scheduler Details (2016) (hamidreza-s.github.io)
151 points by StreamBright on Dec 24, 2018 | hide | past | favorite | 19 comments



This is a great post to know about if you do Erlang/Elixir professionally. There was conversation back in 2016 that is still relevant today: https://news.ycombinator.com/item?id=11064763


It is so underrated how easy to use scheduling with Erlang/Elixir. Plus Supervisor/children concept makes it so easy to have a system that runs for years.


I'm a full time elixir dev (and fanboy) but don't agree with your second point.

The only thing "easy" about erlang supervisors is having short-lived DB network hiccup cascade up your supervisor tree and shutting down your app. Clearly supervisors were designed for a set of problems which don't align with what I imagine most Elixir devs are likely to run into.

A lot of people end up building either a special supervisor with infinite retry + backoff, or bake the logic directly into their process, specifically to avoid triggering the built-in supervisor.

If you want a system that runs for years, you're almost certainly going to need an external language-agnostic supervisor (god, upstart, supervisord, docker, ...) At that point, the built-in Supervisor advantage is...overstated.

What isn't overstated is the advantage isolated processes have when it come to managing complexity. Fundamental shift in how technical debt can be made more or less a non-factor.


> is having short-lived DB network hiccup cascade up your supervisor tree and shutting down your app

Can you link me to a description of this? I am currently working on software that expects DB tx on the order of 10/second (so I have literally never had this problem), but am transitioning to a project where I expect to have 10k-100k tx/second. What causes the hiccup, and why is it not taken care of with sensible defaults in the typical libraries (Ecto EG). I'd love to have an ieda of what could be coming down the pike for me (to be defensive)

an external language-agnostic supervisor (god, upstart, supervisord, docker, ...)

So some of my coworkers are old C++ programmers that were FAANG hotshots and imagine it possible to do everything in C++. They don't write unit tests, have just learned go "because it's better" and don't know kubernetes. There is literally no "high uptime/reliability" story in-house ATM.


More generally:

A process (P) fails. Its supervisor (S1) restarts it. If (P) fails 3 times in 5 seconds (configurable,but no backoffs) then (S1) will fail. S1's parent supervisor (S2) will now restart S1, which will restart (P) which might still fail. If P fails quickly enough, you'll cascade your "restart max of X times in Y seconds" all the way up to the application, the ultimate supervisor, which itself will shut down after 3 failures.

For DBs specifically, DBConnection relies on trapping exit and reconnecting with a backoff (it does not rely on supervisors), but how YOUR code deals with (or, doesn't deal with) a failure can result in a cascade:

    defmodule MyProcess do
      use GenServer

      ...

      def something(data) do
         GenServer.cast(__MODULE__, {:something, data})
      end

      def handle_cast({:something, data}, state) do
        ...
        Postgrex.query!("lower case sql because we aren't monsters", [])
        ...
      end
    end
If your DB goes down, this process will crash when something/1 is called. If something/1 is called at a rate greater than the supervisors are configured to accept, it'll take down the app.


One useful piece of advice I recall, perhaps from Mahesh Paolini-Subramanya, was to write defensive code for predictable errors, despite the happy path coding that Erlang allows.

So losing connectivity to a database should not result in a process failure, and thus is something that a supervisor should never have to deal with.

You’re right that there is no silver bullet internal to the language. Even the BEAM itself can fail.


Better yet: add another layer of supervision. Erlang processes are cheap. Might as well take advantage of that.


Cascading failure based on repeated failures in a time window are an excellent way of getting hard to understand system breakdowns under load, and only under load.

Usually you want infinite restarts (restarts, not retries) with some kind of monitoring at the service level. You want your service provider to always be trying to come up, and have a way of it getting attention if that is continuously failing.

(My experience is with Akka, not Erlang, but it did sour me on the supervisor concept without very careful thought. I don't think explicit error handling and supervisors compose well - they almost wholly overlap in behaviour, just with different levels of granularity - but restart policies that bring everything down are an anti-pattern if that logic is in place for a while service.)


I think the idea is ok, but the defaults are poor for many people.

The idea is I just restarted this thing so many times, it's clearly not going to start properly, we need a bigger restart. Up to restarting the whole VM, Erlang ships with heartd to automatically restart the VM, but not a lot of people use that either.

This works ok if all the software running on a node is deeply related, and the normal startup time is more than the escalation threshold. Then it can catch things like bad code push -> instafail. But that doesn't match my environment very well, and it's easy to accidentally trigger.

Another tricky thing is the let it crash mantra really needs to be moderated, crashing in a request handler often really should be caught to give an appropriate response to the requester, and may need to be caught so that other, independent, requests that have already been queued can be processed.


Thanks! That's really helpful!


How is this different from an app dying because of an unhandled exception in Java or Python?

It sounds like you're not using supervisors in a way they are supposed to be: the supervision tree is not for handling your application logic (I mean application in OTP sense). The supervision tree is there to ensure that there's always a way to bring your system to its initial, known-good, state. In a concurrent, but even more so in a distributed context, it's practically a given that you'll run into race conditions and other timing issues which result in loads of heisenbugs. Then, once that happens, you risk your app falling into an invalid state, in which it either ceases to function or continues working but gives wrong results. Erlang addresses this with pervasive immutability, generic servers (in OTP sense) and the supervision tree.

Immutability eliminates some common race conditions and makes sending messages in a distributed system have the same semantics as local message sends. Generic servers provide a well-defined interface for interacting with running processes no matter where they are, which allows for automated handling of initialization, finalization, error handling and upgrades in a concurrent-safe manner. And finally, supervision trees - as mentioned - define an orderly way of initializing, restaring, and terminating all the processes your application consists of and depends on. The supervision tree is not meant for error handling: its primary function is to bring the system to a known-good initial state. Notice that the supervision tree operations are synchronous, and the order of actions is deterministic and constant across runs, exactly because of this.

If you find your app dying because of "network hiccups" it just means that you don't handle them properly in your code. Supervision hierarchy is not meant - by default, although you could write your own supervisor - to handle application-level problems: that's your job, as an application author. If you have a Python client for some REST API, and you lose WiFi connection for a moment, requests will (by default, IIRC) throw an exception and if you don't handle it, your program will terminate. The difference with Erlang/OTP is that in Python, if at the moment of failure you had 10 threads which managed to fetch data before the connection broke and were currently munging it, they will also die immediately in a non-deterministic state. In Erlang, they would be shut down in a well-defined order.

TL;DR: supervisors are a low-level concept and should not be used for handling application logic. They are there to organize your processes and to provide a way to return the system to a known, deterministic state. Handling most of the error conditions still belongs to the app author, despite supervisors being there.


> How is this different from an app dying because of an unhandled exception in Java or Python?

It isn't different, which is my point: Supervisors don't magically make running apps with year-long uptime "easy". You're agreeing with me but in a tone (hard to tell, obviously) that says you aren't.

Restart forever w/backoff isn't the most unreasonable or complicated behaviour to provide. In some cases, writing rigid code can be simpler to test, read and maintain. Go-like error checking / matching can be excessive when you expect something to work 99.9% of the time, and just dying and restarting [until the network restablishes itself, or a human can look into it] can often be preferred.

Otherwise, people who point to my exaggerated example and say "you're doing it wrong", seem to be implying that supervisors are great as long as you write bug-free code and consider every possibility upfront.


> It isn't different, which is my point: Supervisors don't magically make running apps with year-long uptime "easy".

Well, ok, supervisors are not magic, but they are a convenient tool for handling the specific problem of initializing and shutting down a concurrent system.

> Restart forever w/backoff isn't the most unreasonable or complicated behaviour to provide.

Yes, and you can easily provide it yourself by writing your own supervisor and plugging it into the tree, or even just running it linked to some other process. That's for backoff - the "forever" part you can get by simply using infinity as timeout and max restarts I think.

> seem to be implying that supervisors are great as long as you write bug-free code and consider every possibility upfront.

I don't think I understand what you mean. There's nothing stopping you from cutting a particular supervision tree branch off by not linking it, but monitoring instead. The restarts will then be contained to that subtree.


Just don't let it crash at a rate greater than the supervisors are configured to accept.

It's easy enough

If you let it crash when it is expected to crash (not handling the exception) it's your fault

Supervisors are for unexpected crashes, they are not a way to propagate exception handling to the parent process


Note HN has hellbanned your account, and your comments are showing up as dead.


Thanks

I know, I've been having this problem for over a year know, they refuse to unban me because I've said a few times something bad about US and refused to take it back

Last "offending" post is from nov 2017, on June they were still "angry" at me, eventually I gave up, I don't believe there is free speech here, more than in any other social network

I'm keeping it anyway, I got tons of posts saved on this account

Anyway, I'm using this opportunity to post their answer

    From: Scott Bell <scott@ycombinator.com>

    Wed, Jun 13, 2018, 8:38 PM

    Posts that are dead can be vouched for by the community 
    to make them visible and replyable. Perhaps you can let 
    them know?


I am not an expert on the language but wrote couple of Elixir/Erlang applications. I would love to learn more about the problem you are describing. Using Elixir umbrella, isn't each child application on their own and you don't end up with a monolithic structure?

Edit: I think I understand what you mean now. Would you be reading this answer if you have time? https://stackoverflow.com/a/48823641/1238090


I heard horror story about deploying Erlang/Elixir.


The deployment story has improved significantly this year with the release of distillery 2.0

https://dockyard.com/blog/2018/08/23/announcing-distillery-2...

Also, it was never as bad as everyone made it seem, but most of the gripes have been remedied :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: