Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Author here. The benchmark part could be clearer; I acknowledge that.

Interestingly, when working with a limited number of threads, the thread approach is actually faster in that benchmark. So in practical applications, the differences are marginal and likely lean towards threads.

But even if this weren't the case, context matters. A 10ms discrepancy in a web request might be acceptable. However, in a high-performance networking application - which, let's be honest, isn't a common project for most - it could be significant.



If you would measure pure latency between a single request (HTTP, RPC, whatever), the latency difference between any async or non async implementation should be microseconds at most and never milliseconds. If its more, then something with the implementation is off. And as you mentioned threads might even be faster, because there is no need to switch between threads (like in a multithreaded async runtime) or are there needs for additional syscalls (just read, not epoll_wait plus read).

async runtimes can just perform better at scale or reduce the amount of resources at scale. Where "at scale" means a concurrency level of >= 10k in the last benchmarks I did on this.


Concurrent is rarely faster than parallel, across almost any language that supports it. If you know that you don't need obscene scalability (1000 connections is pushing the edge of what's reasonable with parallelism) then stick with parallelism. If you overuse parallelism then expect your entire system (OS and all) to grind to a halt through context switching.


You'd be surprised. 10k threads is more than manageable on Linux.


paging Dan Kegel…


Lol, fair enough, but the C10k problem these days does have a "just use OS threads" solution. It wasn't free; it took a lot of work across the industry. Computers have gotten both faster both single core and wider number of cores. And kernels have spent the last couple decades really working hard on their schedulers and kernel/user sync primitives to handle those high thread counts.

The native model falls apart under C10M, but to be fair so does traditional epoll/queue/iocp dispatching coroutines model of solving C10K. That's where you start having to keep the network stack and application data plane colocated in the same context as much as possible. That can be done with something like DPDK to keep those both in user space, or Netflix is known for their FreeBSD work making KTLS and sendfile kiss in order to keep the data plane completely in the kernel.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: