Haskell will use less memory on this task. Since it has a shared heap, it doesn't have to allocate a small heap per process, and thus it is expected that it has lower overhead. Furthermore, static typing means Haskell needs less type tags and this tend to make it win. As the system grows in complexity, these things tend to even out a bit more but I would still expect Haskell to use about half the memory of Erlang.
The Elixir numbers for Phoenix sounds off. At 83765 megabytes and 1999984 connections, that is 42 kilobytes per connection. That count is about an order of magnitude over what I would expect it to be. How much of that memory is kernel allocated network buffer space, and how much is buffer space in the Erlang runtime? A "raw" process in Erlang is around 1.5 kilobytes nowadays, including stack and heap, so where do the additional 40 get allocated to? I don't think we have 20 extra processes per connection for some reason :) Definitely something to look into.
Tsung is an old application. It isn't really written in ways that makes it efficient at the network level and this shows its head in this benchmark. Furthermore, Tsung does more work than the broadcast in Haskell, so it is expected the load generator will give up long before the server. Again, measure the amount of memory allocated by the kernel and by the userland process in order to determine if it is one or the other you hit first. Still, I would expect Tsung to be the culprit.
I don't have hard numbers on where our 40kb per client is being allocated, but to be fair, we are doing a lot more work than the Haskell example. Our "Channels" layer is a full-featured part of the framework. For each WebSocket connection, we start a WebSocket transport process that multiplexes channel servers that subscribe to different topics. Each of these live underneath a supervisor process as well that monitors the servers. So out of the gate we are starting three processes per connection (one for the supervisor, one for the transport, and one for the single "rooms:lobby" channel). We set up monitors to detect the channel crashing/closing so we can notify the client that that channel went away. These things have overhead. I think we have room to optimize our conn size, but it's worth mentioning it's not a raw WS vs WS comparison.
In addition, the classic PubSub pattern here screams the Disruptor pattern (which I first heard about from Trisha Gee and Martin Thompson). The Erlang runtime has no direct good support for this kind of pattern, so you have to opt for things such a ETS to simulate it. It'll work albeit at an overhead.
We're using ets for our PubSub layer (which sits under channels). I'm not familiar with the Disruptor patterns, but we've been extremely happy with ets. The latest optimizations to come out of our benchmarks have us sharding pusbsub subscriber by Pid into ets tables managed by pooled pubsub servers. Our PubSub layer also is distributed out of the box. So the flow is ets tables for "local" subscribers on each node, then we use a pg2 group to bridge the broadcasts across the cluster. The pg2 bridge is our next area for stress testing.
Relatively high compared to what? With GHC we have a single-word header on objects, which compares favorably to C# which usually has two-word headers, or Java, which similarly uses one word. Of course, GC-less languages like Rust or C++ have usually no tags at all, but I think it makes more sense to compare among GC-d languages.
Why is this there? Is it to facilitate things like Typeable? I believe that there's no language-level way to do things like runtime type reflection. And even if there were, how would one express a complex type like (Vector (forall a. MyTypeClass a => a -> Int, String))?
I'm also curious if dependently typed languages like Idris, which presumably must be able to have runtime access to type information, handle this stuff.
For values, laziness means there is a tag bit for whether a value is a thunk or evaluated. Sum types use tags to determine which variant is active.
For functions, because a function that takes two arguments and returns a value (a -> a -> a) has the same type as a function that takes one argument and returns a function that takes another argument that returns a value (a -> a -> a), the arity of functions is stored in the tag.
Some of these tags are eliminated by inlining but if you sit down and read some typical Haskell output you'll see a _whole lot_ of tag checks.
Source: spent a lot of time reading GHC output and writing high-performance Haskell code.
what is the state-of-the-art posix bench tool in your opinion? I'm aware of things like gatling but not sure where on the spectrum it sits in terms of features or popularity
I don't think anything has the flexibility of tsung to be honest. It can test many different protocols already. A better way would probably be to optimize it a bit for lower memory usage.
For web server benchmarking, only wrk2 by Gil Tene does things correctly. Everything else usually does coordinated omission:
Imagine you have 10.000 connections. Each connection is doing 3 req/s. Let's say one connection blocks for 1 second, which means that 2 req's should have fired on that connection "in between". wrk2 will count those two as being "late" whereas most other load generators won't count at all. This means a framework can opt to "stall" some connections in order to get better performance and fewer bad results in the upper latencies.
As an example, here are the Erlang/Cowboy numbers for such a test in wrk2:
Note how the median latency and the 75th percentile is better for Haskell, but that it occasionally stalls requests for quite some time, probably due to a GC pause or some other cleanup that happens and then in an unfortunate moment all has to happen at the same point in time.
If you go look at typical benchmarks their latency reporting is way off compared to this, which is a surefire way of knowing they did not account for coordinated omission.
Mind, when benchmarks disagree, the trick is to explain why this happens. It often leads to an insight in design difference.
Haskell will use less memory on this task. Since it has a shared heap, it doesn't have to allocate a small heap per process, and thus it is expected that it has lower overhead. Furthermore, static typing means Haskell needs less type tags and this tend to make it win. As the system grows in complexity, these things tend to even out a bit more but I would still expect Haskell to use about half the memory of Erlang.
The Elixir numbers for Phoenix sounds off. At 83765 megabytes and 1999984 connections, that is 42 kilobytes per connection. That count is about an order of magnitude over what I would expect it to be. How much of that memory is kernel allocated network buffer space, and how much is buffer space in the Erlang runtime? A "raw" process in Erlang is around 1.5 kilobytes nowadays, including stack and heap, so where do the additional 40 get allocated to? I don't think we have 20 extra processes per connection for some reason :) Definitely something to look into.
Tsung is an old application. It isn't really written in ways that makes it efficient at the network level and this shows its head in this benchmark. Furthermore, Tsung does more work than the broadcast in Haskell, so it is expected the load generator will give up long before the server. Again, measure the amount of memory allocated by the kernel and by the userland process in order to determine if it is one or the other you hit first. Still, I would expect Tsung to be the culprit.