TBH at the load we had we got substantial savings by eventually replacing usage of gen_server as well, though that probably isn’t a good idea much of the time.
The OOMs were largely being caused by calls to and from other services (i.e. kafka) so the answer proved to be in controlling the rate at which things come in and out at the very edge.
From what I saw I got the impression the Beam devs assumed memory and CPU usage go together so a system that is under load memory wise would also be CPU wise, but this isn’t the case if your fan out and gather involves holding large* values on which the response is based, even if for tiny amounts of time.
EDIT: *large meaning "surprisingly small" if you're coming from other universes.
The underlying problem we had* was the rate work was being completed was lower than the rate requests were coming in which causes the mailboxes on the actors to grow indefinitely.
In golang the approximate equivalent is a buffered channel that would start blocking because it has run out, but the beam will just keep those messages piling on to the queue as fast as possible until OOM. This is obviously a philosophical tradeoff.
* I should qualify that each request here had a life in low double digit ms, but there were millions of them per second, and these were very big machines.
Why weren't your processes dropping messages? Also I think you can tell the VM to not allow the process to exceed a certain message size and trigger some sort of rate limiting or scaling out
Edit: huh. I could swear the VM had memory limit options. Guess not. Time to rewrite it in zig!
Yeah, I think that's the assumption people had been operating under.
That team would have thoroughly endorsed a zig rewrite! It was a very odd situation where most of us liked erlang the language but found the beam to be an annoying beast, whereas most of the world seems to be the opposite.
The OOMs were largely being caused by calls to and from other services (i.e. kafka) so the answer proved to be in controlling the rate at which things come in and out at the very edge.
From what I saw I got the impression the Beam devs assumed memory and CPU usage go together so a system that is under load memory wise would also be CPU wise, but this isn’t the case if your fan out and gather involves holding large* values on which the response is based, even if for tiny amounts of time.
EDIT: *large meaning "surprisingly small" if you're coming from other universes.