Some comments: Haskell will use less memory on this task. Since it has a shared ...

chrismccord · on Nov 5, 2015

I don't have hard numbers on where our 40kb per client is being allocated, but to be fair, we are doing a lot more work than the Haskell example. Our "Channels" layer is a full-featured part of the framework. For each WebSocket connection, we start a WebSocket transport process that multiplexes channel servers that subscribe to different topics. Each of these live underneath a supervisor process as well that monitors the servers. So out of the gate we are starting three processes per connection (one for the supervisor, one for the transport, and one for the single "rooms:lobby" channel). We set up monitors to detect the channel crashing/closing so we can notify the client that that channel went away. These things have overhead. I think we have room to optimize our conn size, but it's worth mentioning it's not a raw WS vs WS comparison.

jlouis · on Nov 5, 2015

That would explain it.

In addition, the classic PubSub pattern here screams the Disruptor pattern (which I first heard about from Trisha Gee and Martin Thompson). The Erlang runtime has no direct good support for this kind of pattern, so you have to opt for things such a ETS to simulate it. It'll work albeit at an overhead.

chrismccord · on Nov 5, 2015

We're using ets for our PubSub layer (which sits under channels). I'm not familiar with the Disruptor patterns, but we've been extremely happy with ets. The latest optimizations to come out of our benchmarks have us sharding pusbsub subscriber by Pid into ets tables managed by pooled pubsub servers. Our PubSub layer also is distributed out of the box. So the flow is ets tables for "local" subscribers on each node, then we use a pg2 group to bridge the broadcasts across the cluster. The pg2 bridge is our next area for stress testing.

bos · on Nov 5, 2015

A small correction: GHC uses plenty of type tagging information. In fact, its metadata overhead is relatively high.

Kutta · on Nov 5, 2015

Relatively high compared to what? With GHC we have a single-word header on objects, which compares favorably to C# which usually has two-word headers, or Java, which similarly uses one word. Of course, GC-less languages like Rust or C++ have usually no tags at all, but I think it makes more sense to compare among GC-d languages.

thinkpad20 · on Nov 5, 2015

Why is this there? Is it to facilitate things like Typeable? I believe that there's no language-level way to do things like runtime type reflection. And even if there were, how would one express a complex type like (Vector (forall a. MyTypeClass a => a -> Int, String))?

I'm also curious if dependently typed languages like Idris, which presumably must be able to have runtime access to type information, handle this stuff.

chadaustin · on Nov 6, 2015

For values, laziness means there is a tag bit for whether a value is a thunk or evaluated. Sum types use tags to determine which variant is active.

For functions, because a function that takes two arguments and returns a value (a -> a -> a) has the same type as a function that takes one argument and returns a function that takes another argument that returns a value (a -> a -> a), the arity of functions is stored in the tag.

Some of these tags are eliminated by inlining but if you sit down and read some typical Haskell output you'll see a _whole lot_ of tag checks.

Source: spent a lot of time reading GHC output and writing high-performance Haskell code.

15155 · on Nov 6, 2015

In Idris, as far as I know, runtime type information is kept around by default and erased through usage-based optimization (and possibly annotation?)

http://docs.idris-lang.org/en/latest/reference/erasure.html

platz · on Nov 5, 2015

what is the state-of-the-art posix bench tool in your opinion? I'm aware of things like gatling but not sure where on the spectrum it sits in terms of features or popularity

jlouis · on Nov 5, 2015

I don't think anything has the flexibility of tsung to be honest. It can test many different protocols already. A better way would probably be to optimize it a bit for lower memory usage.

For web server benchmarking, only wrk2 by Gil Tene does things correctly. Everything else usually does coordinated omission:

Imagine you have 10.000 connections. Each connection is doing 3 req/s. Let's say one connection blocks for 1 second, which means that 2 req's should have fired on that connection "in between". wrk2 will count those two as being "late" whereas most other load generators won't count at all. This means a framework can opt to "stall" some connections in order to get better performance and fewer bad results in the upper latencies.

As an example, here are the Erlang/Cowboy numbers for such a test in wrk2:

 	Latency Distribution (HdrHistogram - Recorded Latency)
	 50.000%    6.33ms
	 75.000%   10.23ms
	 90.000%   13.73ms
	 99.000%   22.37ms
	 99.900%   31.26ms
	 99.990%   38.62ms
	 99.999%   45.09ms
	100.000%   49.60ms

Whereas the Haskell wai framework returns

         Latency Distribution (HdrHistogram - Recorded Latency)
	 50.000%    1.71ms
	 75.000%    2.64ms
	 90.000%   14.92ms
	 99.000%   38.75ms
	 99.900%  742.40ms
	 99.990%  985.60ms
	 99.999%    1.03s 
	100.000%    1.05s

Note how the median latency and the 75th percentile is better for Haskell, but that it occasionally stalls requests for quite some time, probably due to a GC pause or some other cleanup that happens and then in an unfortunate moment all has to happen at the same point in time.

If you go look at typical benchmarks their latency reporting is way off compared to this, which is a surefire way of knowing they did not account for coordinated omission.

Mind, when benchmarks disagree, the trick is to explain why this happens. It often leads to an insight in design difference.

atombender · on Nov 5, 2015

Looks like wrk 4.0 [1] now has a similar "HDR histogram" latency algorithm as wrk2 -- have you tried it out?

[1] https://github.com/TechEmpower/FrameworkBenchmarks/issues/12...