The Oracle database has adopted process-level parallelism, utilizing System V IP...

anarazel · on Feb 19, 2023

Postgres also uses a multi process architecture. But I think that turned out to be a mistake for something like a database, on modern systems.

There are other reasons, but the biggest problem is that inter process context switches are considerably more expensive than intra process ones. Far less efficient use of the TLB being a big part of that. It used to be worse before things like process context identifiers, but even with them you're wasting a large portion of the TLB by storing redundant information.

chasil · on Feb 19, 2023

Oracle still managed to be the TPC-C performance leader from 12/2010 until Oceanbase took the crown in 2019 (I don't think that Oracle cares anymore).

https://www.tpc.org/tpcc/results/tpcc_results5.asp?print=fal...

https://www.alibabacloud.com/blog/oceanbase-breaks-tpc-c-rec...

They did this with an SGA (Shared Global Area) that (I'm assuming) pollutes the Translation Lookaside Buffer (TLB) with different addresses for this shared memory in every process.

anarazel · on Feb 19, 2023

Yea, it's not something that's going to stop you dead in your tracks. But it does show up noticeably in profiles. There is stuff that you can do to reduce the pain, like using gigantic pages for the shared memory and remapping your executable's read-only sections, at runtime, so that you're using huge pages for your code. But even after that you can see the cost noticeably.

twoodfin · on Feb 20, 2023

Right. One of several bits of architectural friction that didn’t matter while your large DBMS was I/O constrained.

anarazel · on Feb 20, 2023

> Right. One of several bits of architectural friction that didn’t matter while your large DBMS was I/O constrained.

Yep. Lots of architectural design out there based on IO latency being higher by an order of magnitude or two than now, while memory latency only shrank modestly.

To be clear, using one process with loads of threads has its own set of issues. Much easier to hit contention in the kernel, e.g. the infamous mmap_sem in linux.

anonymousDan · on Feb 20, 2023

Wouldn't you just have a process per core and avoid context switches that way? Or does it require a process pool with more processes than that because everything is I/O bound?

anarazel · on Feb 20, 2023

That's not a panacea either. You either need the client connection file descriptor in another process, or incur the overhead of marshalling the query + query results over IPC.

anonymousDan · on Feb 25, 2023

Interesting. Can't the different processes all listen for connections on the same socket (e.g. using EPOLL_EXCLUSIVE or somesuch)? I kind of agree with you though that if you need to have some kind of fine-grained coordination between processes using shared memory then the failure of one process could still leave the shared memory in a bad state. I suppose you reduce the blast radius a little for memory corruption issues when programming in something like C, but these days if you are using a type safe language I'm not convinced it buys you much in terms of fault tolerance. But I am really keen to understand if you actually lose anything in terms of performance.

anarazel · on Feb 25, 2023

> I suppose you reduce the blast radius a little for memory corruption issues when programming in something like C, but these days if you are using a type safe language I'm not convinced it buys you much in terms of fault tolerance.

That's an often cited advantage, and I think it's largely bogus. In postgres the only process that benefits from that is the "supervisor" process ("postmaster"). If any of the other processes crash / exit unexpectedly, all others are restarted as well, and the database is restarted via crash recovery. The supervisor could stay a distinct process even when otherwise using threads. (And no, restarting in systemd etc doesn't achieve the same, we continue to hold onto sockets and shared memory across such crash restarts)

> But I am really keen to understand if you actually lose anything in terms of performance.

Much higher TLB hit rate - separate processes will have separate entries in the TLB, leading to lower hit rates.

Even aside from TLB handling, cross thread context switches are cheaper than cross process ones.

Ability to dynamically change the amount of shared memory, without needing to play tricks with pre-reserving memory ranges, or dealing with relative pointers, makes it easier to improve performance in a lot of areas.

Not needing file descriptors open N times makes things cheaper.

Ability to easily hand of file descriptors from one thread to another makes it easier to achieve higher CPU utilization.

dragontamer · on Feb 19, 2023

> utilizing System V IPC

Hmm, that's a bit more complex than what I'd put at #1. I'd probably put System V IPC closer to #2 ("use a mutex") levels of complications.

System V Shared memory + Semaphores is definitely "as complicated" as pthread mutexes and semaphores.

But messages, signals, pipes, and other process-level IPC is much simpler. I guess SystemV IPC exists for that shady region "between" the high level stuff, and the complex low-level mutexes / semaphores.

Maybe "1.75", if I were to put it in my list above somewhere. Closer to Mutexes in complexity, but still simpler in some respects. Depends on what bits of System V IPC you use, some bits are easier than others.

---------

The main benefit of processes is that startup and shutdown behavior is very well defined. So something like a pipe, mmap, and other I/O has a defined beginning and end. All sockets are closed() properly, and so forth.

SystemV throws a monkey wrench into that, because the semaphore or shared memory is "owned by Linux", so to speak. So a sem_post() is not necessarily going to be sem_wait(), especially if a process dies in a critical region.