No, it won’t. This is not the user land you’re talking about and in general the idea that multiple, isolated processes can do better on the same CPU, versus a monolithic process that does shared memory concurrency is ... a myth ;-)
For throughput, separate processes on separate cores with loose synchronisation will do better than a monolith. You don't want to share memory, you want to hand it off to different stages of work.
Consider showing a webpage. You have a network stack, a graphics driver, and the threads of the actual browser process itself. It's substantially easier to about bottlenecking through one or more locks (for, say an open file table, or path lookup, etc) when the parts of the pipeline are more separated than a monolithic kernel.
> Lock free concurrency is typically via spinning and retrying, suboptimal when you have real contention.
Lock free concurrency is typically done by distributing the contention between multiple memory locations / actors, being wait free for the happy path at least. The simple compare-and-set schemes have limited utility.
Also actual lock implementations at the very least start by spinning and retrying, falling back to a scheme where the threads get put to sleep after a number of failed retries. More advanced schemes that do "optimistic locking" are available, for the cases in which you have no contention, but those have decreased performance in contention scenarios.
> Handing off means to stop using it and letting someone else use it. Only copy in rare cases.
You can't just let "someone else use it", because blocks of memory are usually managed by a single process. Transferring control of a block of memory to another process is a recipe for disaster.
Of course there are copy on write schemes, but note that they are managed by the kernel and they don't work in the presence of garbage collectors or more complicated memory pools, in essence the problem being that if you're not in charge of a memory location for its entire lifetime, then you can't optimize the access to it.
In other words, if you want to share data between processes, you have to stream it. And if those processes have to cooperate, then data has to be streamed via pipes.
> High performance applications get the kernel out of the way because it slows things down.
Not because the kernel itself is slow, but because system calls are. System calls are expensive because they lead to context switches, thrashing caches and introducing latency due to blocking on I/O. So the performance of the kernel has nothing to do with it.
You know what else introduces unnecessary context switches? Having multiple processes running in parallel, because in the context of a single process making use of multiple threads you can introduce scheduling schemes (aka cooperative multi-threading) that are optimal for your process.
System calls are not the reason the kernel is bypassed. The cost of the system calls is fixable. For example it is possible to batch them together into a single system call at the end of the event loop iteration or even share a ring buffer with the kernel and talk to the kernel the same way high performance apps talks to the nic. But the problem is that the kernel itself doesn't have high performance architecture, subsystems, drivers, io stacks, etc., so you can't get far using it and there is no point investing time into it. And it is this way, because monolithic kernel doesn't push developers into designing architecture and subsystems that talk to each other purely asynchronously with batching, instead crappy shared memory designs are adopted as they feel easier to monolithic developers, while in fact being both harder and slower to everyone.