Or may be better at splitting up work across cores. Less lock/cache contention, ...

bad_user · on April 11, 2018

No, it won’t. This is not the user land you’re talking about and in general the idea that multiple, isolated processes can do better on the same CPU, versus a monolithic process that does shared memory concurrency is ... a myth ;-)

lallysingh · on April 11, 2018

For throughput, separate processes on separate cores with loose synchronisation will do better than a monolith. You don't want to share memory, you want to hand it off to different stages of work.

Consider showing a webpage. You have a network stack, a graphics driver, and the threads of the actual browser process itself. It's substantially easier to about bottlenecking through one or more locks (for, say an open file table, or path lookup, etc) when the parts of the pipeline are more separated than a monolithic kernel.

bad_user · on April 12, 2018

“Handing off” via sharing memory is much more efficient than copying.

Lock-free concurrency is also achievable.

Again, this isn’t the user land we’re talking about, in the sense that the kernel is expected to be highly optimized.

Granted, a multi process architecture does have virtues, like stability and security. But performance is not one of them.

lallysingh · on April 12, 2018

Handing off means to stop using it and letting someone else use it. Only copy in rare cases.

Lock free concurrency is typically via spinning and retrying, suboptimal when you have real contention. It's better not to contend.

Kernel code isn't magic, its performance is dominated by cache just like user space.

High performance applications get the kernel out of the way because it slows things down.

bad_user · on April 12, 2018

> Lock free concurrency is typically via spinning and retrying, suboptimal when you have real contention.

Lock free concurrency is typically done by distributing the contention between multiple memory locations / actors, being wait free for the happy path at least. The simple compare-and-set schemes have limited utility.

Also actual lock implementations at the very least start by spinning and retrying, falling back to a scheme where the threads get put to sleep after a number of failed retries. More advanced schemes that do "optimistic locking" are available, for the cases in which you have no contention, but those have decreased performance in contention scenarios.

> Handing off means to stop using it and letting someone else use it. Only copy in rare cases.

You can't just let "someone else use it", because blocks of memory are usually managed by a single process. Transferring control of a block of memory to another process is a recipe for disaster.

Of course there are copy on write schemes, but note that they are managed by the kernel and they don't work in the presence of garbage collectors or more complicated memory pools, in essence the problem being that if you're not in charge of a memory location for its entire lifetime, then you can't optimize the access to it.

In other words, if you want to share data between processes, you have to stream it. And if those processes have to cooperate, then data has to be streamed via pipes.

> High performance applications get the kernel out of the way because it slows things down.

Not because the kernel itself is slow, but because system calls are. System calls are expensive because they lead to context switches, thrashing caches and introducing latency due to blocking on I/O. So the performance of the kernel has nothing to do with it.

You know what else introduces unnecessary context switches? Having multiple processes running in parallel, because in the context of a single process making use of multiple threads you can introduce scheduling schemes (aka cooperative multi-threading) that are optimal for your process.

zzzcpan · on April 12, 2018

System calls are not the reason the kernel is bypassed. The cost of the system calls is fixable. For example it is possible to batch them together into a single system call at the end of the event loop iteration or even share a ring buffer with the kernel and talk to the kernel the same way high performance apps talks to the nic. But the problem is that the kernel itself doesn't have high performance architecture, subsystems, drivers, io stacks, etc., so you can't get far using it and there is no point investing time into it. And it is this way, because monolithic kernel doesn't push developers into designing architecture and subsystems that talk to each other purely asynchronously with batching, instead crappy shared memory designs are adopted as they feel easier to monolithic developers, while in fact being both harder and slower to everyone.

1pfdthrow · on April 11, 2018

"better" meaning what exactly? Are you talking about running a database with high throughput, recording audio with low latency, or computing pi?

lallysingh · on April 12, 2018

And even on the latency side, you just want the kernel out of the damn way.

bad_user · on April 12, 2018

Given the topic we’re discussing, I don’t know what you’re talking about.