1. Try process level I/O, such pipes, sockets, and the like. Have Linux deal wit...

chasil · on Feb 19, 2023

The Oracle database has adopted process-level parallelism, utilizing System V IPC. Threading is used on Windows for performance reasons, but each client gets its own server pid by default on UNIX.

This architecture expresses the original design intentions of "Columbus UNIX."

"CB UNIX was developed to address deficiencies inherent in Research Unix, notably the lack of interprocess communication (IPC) and file locking, considered essential for a database management system... The interprocess communication features developed for CB UNIX were message queues, semaphores and shared memory support. These eventually appeared in mainstream Unix systems starting with System V in 1983, and are now collectively known as System V IPC."

This approach has realized some degree of success.

https://en.m.wikipedia.org/wiki/CB_UNIX

anarazel · on Feb 19, 2023

Postgres also uses a multi process architecture. But I think that turned out to be a mistake for something like a database, on modern systems.

There are other reasons, but the biggest problem is that inter process context switches are considerably more expensive than intra process ones. Far less efficient use of the TLB being a big part of that. It used to be worse before things like process context identifiers, but even with them you're wasting a large portion of the TLB by storing redundant information.

chasil · on Feb 19, 2023

Oracle still managed to be the TPC-C performance leader from 12/2010 until Oceanbase took the crown in 2019 (I don't think that Oracle cares anymore).

https://www.tpc.org/tpcc/results/tpcc_results5.asp?print=fal...

https://www.alibabacloud.com/blog/oceanbase-breaks-tpc-c-rec...

They did this with an SGA (Shared Global Area) that (I'm assuming) pollutes the Translation Lookaside Buffer (TLB) with different addresses for this shared memory in every process.

anarazel · on Feb 19, 2023

Yea, it's not something that's going to stop you dead in your tracks. But it does show up noticeably in profiles. There is stuff that you can do to reduce the pain, like using gigantic pages for the shared memory and remapping your executable's read-only sections, at runtime, so that you're using huge pages for your code. But even after that you can see the cost noticeably.

twoodfin · on Feb 20, 2023

Right. One of several bits of architectural friction that didn’t matter while your large DBMS was I/O constrained.

anarazel · on Feb 20, 2023

> Right. One of several bits of architectural friction that didn’t matter while your large DBMS was I/O constrained.

Yep. Lots of architectural design out there based on IO latency being higher by an order of magnitude or two than now, while memory latency only shrank modestly.

To be clear, using one process with loads of threads has its own set of issues. Much easier to hit contention in the kernel, e.g. the infamous mmap_sem in linux.

anonymousDan · on Feb 20, 2023

Wouldn't you just have a process per core and avoid context switches that way? Or does it require a process pool with more processes than that because everything is I/O bound?

anarazel · on Feb 20, 2023

That's not a panacea either. You either need the client connection file descriptor in another process, or incur the overhead of marshalling the query + query results over IPC.

anonymousDan · on Feb 25, 2023

Interesting. Can't the different processes all listen for connections on the same socket (e.g. using EPOLL_EXCLUSIVE or somesuch)? I kind of agree with you though that if you need to have some kind of fine-grained coordination between processes using shared memory then the failure of one process could still leave the shared memory in a bad state. I suppose you reduce the blast radius a little for memory corruption issues when programming in something like C, but these days if you are using a type safe language I'm not convinced it buys you much in terms of fault tolerance. But I am really keen to understand if you actually lose anything in terms of performance.

anarazel · on Feb 25, 2023

> I suppose you reduce the blast radius a little for memory corruption issues when programming in something like C, but these days if you are using a type safe language I'm not convinced it buys you much in terms of fault tolerance.

That's an often cited advantage, and I think it's largely bogus. In postgres the only process that benefits from that is the "supervisor" process ("postmaster"). If any of the other processes crash / exit unexpectedly, all others are restarted as well, and the database is restarted via crash recovery. The supervisor could stay a distinct process even when otherwise using threads. (And no, restarting in systemd etc doesn't achieve the same, we continue to hold onto sockets and shared memory across such crash restarts)

> But I am really keen to understand if you actually lose anything in terms of performance.

Much higher TLB hit rate - separate processes will have separate entries in the TLB, leading to lower hit rates.

Even aside from TLB handling, cross thread context switches are cheaper than cross process ones.

Ability to dynamically change the amount of shared memory, without needing to play tricks with pre-reserving memory ranges, or dealing with relative pointers, makes it easier to improve performance in a lot of areas.

Not needing file descriptors open N times makes things cheaper.

Ability to easily hand of file descriptors from one thread to another makes it easier to achieve higher CPU utilization.

dragontamer · on Feb 19, 2023

> utilizing System V IPC

Hmm, that's a bit more complex than what I'd put at #1. I'd probably put System V IPC closer to #2 ("use a mutex") levels of complications.

System V Shared memory + Semaphores is definitely "as complicated" as pthread mutexes and semaphores.

But messages, signals, pipes, and other process-level IPC is much simpler. I guess SystemV IPC exists for that shady region "between" the high level stuff, and the complex low-level mutexes / semaphores.

Maybe "1.75", if I were to put it in my list above somewhere. Closer to Mutexes in complexity, but still simpler in some respects. Depends on what bits of System V IPC you use, some bits are easier than others.

---------

The main benefit of processes is that startup and shutdown behavior is very well defined. So something like a pipe, mmap, and other I/O has a defined beginning and end. All sockets are closed() properly, and so forth.

SystemV throws a monkey wrench into that, because the semaphore or shared memory is "owned by Linux", so to speak. So a sem_post() is not necessarily going to be sem_wait(), especially if a process dies in a critical region.

bonzini · on Feb 20, 2023

4 is a mistake. The fundamental primitive for multiprocessing is message passing and release/acquire is just that, basically release is send and acquire is receive. If you have to go lock free, there are well-known patterns to communicate from one thread to another, and you should use those instead of just a sequentially consistent atomic.

The best solution, however, is just to split your data and use coarse-grained mutexes.

dragontamer · on Feb 20, 2023

> and you should use those instead of just a sequentially consistent atomic.

Ehhh... sometimes the best solution to the "bank account parallelism" problem is just:

    atomic_int bobs_bank_account_balance;

    // Thread#1
    bobs_bank_account_balance += 100; // Depositing $100 in a sequentially consistent way.


    // In Thread#2
    bobs_bank_account_balance -= 100; // Withdrawing $100 in a sequentially consistent way.

No reason to bring in acquire vs release barriers or anything more complex. Just... atomically add and atomically subtract as needed. Not all cases are this simple, but many cases are. So you might as well try this and see if it is good enough.

If not, then yeah, you move onto more complex paradigms. But always try the dumb and simple solutions first, before trying the harder stuff.

----------

This case is super common, that its even optimized in GPU programming. I've seen atomics like this become optimized into a prefix-sum routine by the compiler.

Yes, this means you can have thousands of GPU-threads / shaders performing atomic adds / subtracts in GPU-space, and the atomic will be surprisingly efficient.

The problem is that this paradigm doesn't always work. It takes skill to know when paradigms fail or succeed, and its sometimes very subtle. (That's why I say: try this, but... speak with an expert when doing so). There might be a subtle race condition. But in the cases where this works, absolutely program in this way.

bonzini · on Feb 20, 2023

The question is what the invariants are around those operations. It is rarely the case that you can get away with simple RMW operations, because they don't guarantee any invariants. Also, sequentially consistent RMW atomic operations don't order with non-sequentially consistent atomics (the exception being the seqcst fence) so it's hard to construct send/receive operations using seqcst atomics—if you can use them, chances are that even relaxed could be enough!

Going deeper into the atomic add example, are you sure that the cache line bouncing will not be an issue? can you perhaps make the code just update something that you already have exclusive access to, and sum multiple values when you do a read (hopefully it's rare, e.g. reading a statistic once a second)? So again the solution could be to use a mutex and split the data so that the mutex is mostly uncontended.

dragontamer · on Feb 20, 2023

> Also, sequentially consistent RMW atomic operations don't order with non-sequentially consistent atomics

So just use sequentially-consistent atomics everywhere, unless otherwise needed.

_No one_ should be itching to touch that acquire/release paradigm unless you really have to. Its grossly more complex, and very few programmers understand it.

Acquire/release exists because its necessary. (Ex: implementation of spinlocks/mutexes). But its a tool no one should feel good about using, its very low level, very subtle, and full of potential traps.

A good acquire/release lock-free algorithm or data-structure is still a PH.D thesis level material these days. Its obscure, uncommon, and difficult to write. Don't do it unless you have to. And if you have to, try all the patterns that have been figured out already before innovating.

> Going deeper into the atomic add example, are you sure that the cache line bouncing will not be an issue?

Do you mean false sharing?

False sharing is a performance issue. Your code will be correct, albeit slower. That's fine. Furthermore, acquire/release doesn't do anything to solve false sharing, you need to change your memory layout so that objects are on different cache lines.

> So again the solution could be to use a mutex and split the data so that the mutex is mostly uncontended.

We're only at "#4" because "#1, #2, and #3 have failed". If you can solve things with a mutex, slap down a mutex and call it done. Don't reach for the more complex tools unless necessary.

zozbot234 · on Feb 20, 2023

Yes, if you can make do with a single atomic-sized object, you can perform any RMW on it with either a CAS loop or a special-cased atomic operation (like add or subtract) and not need any further synchronization. What the GP commenter described as being potentially dangerous and nedding expert knowledge is going in any way beyond that. It's really easy to e.g. trigger ABA problems and other issues without realizing it. So just use a mutex to synchronize access instead.

bmikaili · on Feb 20, 2023

OpenMP is so dead simple it's insane. Had a class on Parallel Computing (mainly for super computers / scientific computing) and while at the beginning I thought it'd be super hard, in the end it was just slapping #pragma omp parallel on everything

zelphirkalt · on Feb 19, 2023

There is also: Try factoring out pure functions and run those in parallel.

aleksiy123 · on Feb 20, 2023

This is my favourite technique.

Keep state in messages. Keep functions pure.

There can be drawbacks but for most types of tasks this works very well.

roeles · on Feb 20, 2023

This sounds interesting. Can you share an example piece of code?

aleksiy123 · on Feb 20, 2023

The general idea is to have a worker/handler which has one or more pure functions to perform some piece of work.

This worker should be stateless. Or if does need state you should put into a centralized location which supports concurrent access (e. g. A DB technically no longer pure).

Your orchestration script takes gives tasks to your workers and collects the result. The key idea is you can horizontally scale the amount of workers because they are stateless.

A simple example of this might be a an orchestrator script that creates a bunch of tasks (e.g. adding two numbers) and pushes it to a input queue. The workers take a task off the queue, and push the result on the result queue. Orchestrator takes the result off and aggregates all the results.

A more complex example is a webserver communicating through RPCs.

dgb23 · on Feb 20, 2023

Not GP but I think the fundamental idea is to think about "who calls who" and have pure functions be called/scheduled by a stateful coordination mechanism.

Usually queues/channels/event loops are involved at the top level especially if you're doing async IO. If you're doing parallel computation then you'd probably use fan-in/fan-out/waitgroup logic at the top that calls into pure functions. The general point is "only this piece of code worries about coordinating state".

For async IO think Go's concurrency primitives, Clojure's core.async or Erlang/Elixir as examples of coordinating state through messages.

For parallelism specifically look at this pretty cool rust crate:

https://lib.rs/crates/rayon

Specifically at the discussion in the FAQ to get a sense of why it's useful to push mutability to the top level:

https://github.com/rayon-rs/rayon/blob/master/FAQ.md

As a counterpoint I'd like to link to a discussion about FP where Martin Thompson wrote:

https://groups.google.com/g/mechanical-sympathy/c/gSkbc3grzN...

> Most FP persistent data structures take the form of a tree and perform a path copy from leaf to root for the change, reusing as much of the tree as possible. What this results in is a single contention point at the root of the tree for CAS operations. Now expand this view to a larger domain model of many entities, e.g. Customers, Orders, Products, etc., all interrelated and that model needs one or more entry points. Each of these entry points become a contention hotspot as the model is mutated. Also with trees and linked lists being the underlying data structures, performance is limited due to indirection causing cache misses.

So really it matters a lot what you're doing. FP is not a panacea for concurrency and parallelism. As an application programmer that does mostly IO and very little heavy calculation etc. it's for example great.

zelphirkalt · on Feb 20, 2023

> > Most FP persistent data structures take the form of a tree and perform a path copy from leaf to root for the change, reusing as much of the tree as possible. What this results in is a single contention point at the root of the tree for CAS operations. Now expand this view to a larger domain model of many entities, e.g. Customers, Orders, Products, etc., all interrelated and that model needs one or more entry points. Each of these entry points become a contention hotspot as the model is mutated. Also with trees and linked lists being the underlying data structures, performance is limited due to indirection causing cache misses.

But the root of a tree is not mutated. Not mutating anything in the whole tree is the whole point. It will be there, until garbage collected (since no one has a reference to it any longer, but only to other versions / other roots). Not sure I am understanding what he is saying correctly.

Maybe he is talking about the kind of "reduce step" which one has to deal with, when all the parallel running functions return results and how to use the results? They might make a new tree-like thing. Sometimes there is no way around a sequential section in the program, since the problem is inherently sequential. But I don't see the contention hotspots.

JustSomeNobody · on Feb 20, 2023

>Try process level I/O, such pipes, sockets, and the like.

This.

> Have Linux deal with the concurrency problem, not you.

Not just Linux. We did this with our Windows app rewrite. IPC with pipes is fast as hell, just works, and it greatly simplified parallelism for us.

spacechild1 · on Feb 20, 2023

But why do you need a seperate process? You can do the same with threads and queues. The only advantage I can think of is sandboxing, i.e. preventing a misbehaving task from taking down your whole app.

JustSomeNobody · on Feb 20, 2023

Ease of development. Ease of maintenance. Separation of concerns. Just to name a few. For example, one service is for hardware communication. It's job is to communicate with all the various devices, and forward those messages out.

>You can do the same with threads and queues.

Yes, but as OP up the chain mentions, you can simplify by letting the OS handle some of that.

spacechild1 · on Feb 21, 2023

> Ease of development. Ease of maintenance. Seperation of concerns.

I'm not convinced. Managing processes is much more complex than managing threads, particularly if the code should be cross-platform. You need a very good library to hide away all the nasty OS specific details. Same goes for pipes or sockets. Then there is the whole issue of message (de)serialization. I don't see how this can possibly be easier than starting a thread and communicating with concurrent queue.

I mean, subprocesses certainly have their use cases, but I would never see them as a drop-in replacement for threads.

> you can simplify by letting the OS handle some of that.

Processes and threads are both OS resources. For the OS scheduler they are practically equivalent.

dragontamer · on Feb 21, 2023

Consider: what is the easiest way to parallelize GCC to a 32 core system?

Answer: make -j128.

Way easier than trying to make every data structure inside of the compiler into a parallel programming model.

--------

Process level parallelism is a higher level of thinking. As long as you have spare RAM (and let's be frank, we all have 32GB+ sitting around these days), you can add more and more processes to solve your problem.

If you are talking data structures and mutexes, you are working at a far more complex layer than what I'm saying for #1.

----

Even completely single threaded code can be run in parallel in many cases in practice, because we have multiple files or other divisions of labor available at the process / user level.

spacechild1 · on Feb 21, 2023

> Consider: what is the easiest way to parallelize GCC to a 32 core system?

> Answer: make -j128.

> Way easier than trying to make every data structure inside of the compiler into a parallel programming model.

That's a false dichotomy. If the compiler program was implemented as a library, I would rather create 128 threads that each call compileSourceFile() than spawn 128 subprocesses that do the same thing.

The question you should be asking is: do I need my task to execute in a seperate address space? If yes, spawn a subprocess, otherwise use a thread.

dragontamer · on Feb 21, 2023

> I would rather create 128 threads that each call compileSourceFile() than spawn 128 subprocesses that do the same thing.

Would you rather write "compileSourceFile()" in a reentrant way (ie: no global variables, no static variables, guaranteed reentrancy, and other such requirements to work in a typical pthread manner)... or would you rather have processes where all those things are fine and not bugs?

The minute you start up threads with implicitly shared memory spaces... the minute "singleton pattern" suddenly grows complex.

> The question you should be asking is: do I need my task to execute in a seperate address space? If yes, spawn a subprocess, otherwise use a thread.

On the contrary. Separate address spaces by default is far easier. Thread#45 going crazy due to buffer-overflows will demolish thread#25.

But process#45 with a buffer-overflow will not affect process#25.

I/O is also grossly simplified inside the process model. Closing out a process closes() all sockets, pipes, and file I/O automatically, no matter how the process dies. (Ex: Segfaults, kill -9, etc. etc. are all handled gracefully).

If one thread dies, for whatever reason, your program is extremely hosed. Its very difficult to reason where the legitimate state of your multithreaded data-structures is in.

Threads are far more efficient, yes. So if you need efficiency, use them. But most people in my experience are Python or PHP programmers (or other such high level language), where it is clear that performance isn't an issue.

spacechild1 · on Feb 21, 2023

> Would you rather write "compileSourceFile()" in a reentrant way (ie: no global variables, no static variables, guaranteed reentrancy, and other such requirements to work in a typical pthread manner)... or would you rather have processes where all those things are fine and not bugs?

Certainly the former. There is a good reason why you should avoid global state (if possible). I never found it to be particularly hard...

> On the contrary. Separate address spaces by default is far easier. Thread#45 going crazy due to buffer-overflows will demolish thread#25.

As I noted, sandboxing is a valid use case for subprocesses. But this is completely orthogonal to the topic of threads! A buffer overflow can do all sorts of crazy things even in a single-threaded environment and you can totally use sandboxing in sequential code.

> If one thread dies, for whatever reason, your program is extremely hosed.

If one thread crashes, the whole process dies. There is no consistency problem here.

> But most people in my experience are Python or PHP programmers

Ok, I have been rather thinking about languages with first-class threading support (C, C++, Rust, Java, C#, etc.). Most scripting languages do have very limited multi-threading support (or none at all). In Python, for example, it often isn't even possible to achieve CPU level parallelism with threads because of the GIL, so you have to use subprocesses for that.

dragontamer · on Feb 21, 2023

> Certainly the former. There is a good reason why you should avoid global state (if possible). I never found it to be particularly hard...

It only takes one library call into a non-reentrant function to mess everything up in a threading environment. Things are mostly thread-safe today, but I still fall into the trap. Its not like we're double-checking our 3rd party libraries all the time.

In any case, the "Multithreaded Singleton" problem is devilishly difficult to write and full of subtleties. I disagree very strongly about threads making things easier. The singleton pattern is written incorrectly in almost every instance I've seen it in the wild. Global state is important in many cases and cannot be avoided, and threads absolutely complicate it even more than usual.

----------

But lets reverse things for a second. I've listed off multiple kinds of code that work in a process-environment but not in a threading-environment.

You haven't listed off what makes threads actually easier yet. You're saying "mutexes and queues" are easier, and I disagree. At a minimum, a pipe / FIFO performs a similar role as a queue and even has atomic-level guarantees (on read/writes of PIPE_BUF or smaller). Sockets provide client/server model as well.

What I can say for sure, is that pthreads / mutexes / queues are _more efficient_ than pipes / FIFOs / etc. etc. But "simplicity"? Its really not so difficult to read() or write() from a pipe.

> If one thread crashes, the whole process dies. There is no consistency problem here.

Are you sure?

Lets say I pthread_cancel() one of your threads. Is that cool?

Lets compare / contrast with kill(SIGKILL), which is still dangerous but... the state of a process is far more consistent. All fds are closed (including pipes, files, and sockets). This has a side effect of cleaning up flock().

There's a couple of complications involving SystemV semaphores, but even this has been figured out with semaphore adjustment values (which are automatically applied upon process exit, even from a kill).

---------

If thread #45 accepts() a connection, then gets pthread_cancel()d, that socket will effectively be leaked and live forever (because there's no way for anyone else to close() that socket correctly). Especially if you have a PTHREAD_CANCEL_ASYNCHRONOUS flag, these sorts of things can be devilishly hard to debug.

spacechild1 · on Feb 21, 2023

> In any case, the "Multithreaded Singleton" problem is devilishly difficult to write and full of subtleties.

First off, global state does not necessarily require the singleton pattern.

But let's assume that we really need it, e.g. to create a global resource only on demand. This is how it's done in C++:

  class Foo {
  public:
      static Foo& getInstance() {
          // Since C++11, local static variable initialization is thread-safe!
          static Foo instance;
          return instance;
      }
  private:    
      Foo() {
          // expensive constructor
      }
  };

I mostly use C++ and various scripting languages, but judging from the examples in https://en.wikipedia.org/wiki/Double-checked_locking it seems like C#, Java and Go all have at least one simple and safe method to achieve this.

> Global state is important in many cases and cannot be avoided, and threads absolutely complicate it even more than usual.

If you depend on global state in the parent process, how can the subprocesses even operate, since they do not have access to that state? Yes, certain resources, such as loggers, need to be global, but these should really be thread-safe anyway.

> You haven't listed off what makes threads actually easier yet.

* creating and joining threads (in a portable way!) is trivial in most programming languages; creating and joining subprocesses not so much * exchanging data much easier, no marshalling needed * logging is much easier * error handling is much easier * debugging is much easier. (How do you debug a short lived subprocess? The process terminates before the debugger even has a chance to attach to it.)

> You're saying "mutexes and queues" are easier, and I disagree.

Maybe I was not clear, I was really talking about concurrent queues. The producer can push messages, the consumer waits on messages and processes them. It works basically like a pipe/FIFO, but without the pitfalls.

> > If one thread crashes, the whole process dies. There is no consistency problem here. > Are you sure? > Lets say I pthread_cancel() one of your threads. Is that cool?

I was talking about threads crashing.

> If thread #45 accepts() a connection, then gets pthread_cancel()d,

pthread_cancel() is dangerous, I agree. Generally, you should not use it. (I have never needed it.) There are much saner ways to "cancel" a thread, depending on your language. Often it's enough to periodically check a boolean flag.

> Lets compare / contrast with kill(SIGKILL), which is still dangerous but... the state of a process is far more consistent.

Yeah, it is easy to kill a subprocess. But let's consider the opposite: what happens if the parent process dies? How do you make sure that a long running subprocess automatically terminates? It is definitely not trivial.

---

It's fine if you like subprocesses. I just wanted to challenge the notion that they are somehow easier to use than threads - which just doesn't match my experience. We are probably working in entirely different domains, so it's natural that our experiences differ.

dragontamer · on Feb 21, 2023

> it seems like C#, Java and Go all have at least one simple and safe method to achieve this.

Its only simple after you've studied double-checked locking. Initial attempts often lead to failure.

> If you depend on global state in the parent process, how can the subprocesses even operate, since they do not have access to that state?

Plenty of ways to get access. The #1 way is probably to use a database to share that state in a concurrency-safe way. sqlite3 works, though postgresql is more scalable.

There are also solutions that require less resources: flock() a file and then read the shared state in a manner that's cohesive across processes. If your process dies while flock()ing something, the flock() automatically undoes (unlike mutexes where if pthread_cancelled() you could very well have a permanently locked mutex).

That's why so many systems have a database + dedicated process that handles this kind of shared global state between processes.

> Yes, certain resources, such as loggers, need to be global, but these should really be thread-safe anyway.

Loggers are an excellent example of where opening up a pipe or socket to syslogd is far easier than trying to shoehorn in a mutex+queue across threads.

> Yeah, it is easy to kill a subprocess. But let's consider the opposite: what happens if the parent process dies? How do you make sure that a long running subprocess automatically terminates? It is definitely not trivial.

If the parent of a process group is terminated, SIGHUP is sent to its children. Catch the signal then terminate.

So once again: processes handle both situations (parent dies, kill children. Or children die, notify parent), with SIGHUP and SIGCHLD respectively.

No such signaling exists in pthread_blah world. You're (trying to) argue about the "superiority" of threads when processes have all of these issues 100% figured out, while the pthread-world is completely ignorant to these issues.

spacechild1 · on Feb 21, 2023

> The #1 way is probably to use a database to share that state in a concurrency-safe way. sqlite3 works, though postgresql is more scalable.

> There are also solutions that require less resources: flock() a file and then read the shared state in a manner that's cohesive across processes.

Wow. And that is somehow easier than using, say, a concurrent collection? (I have never used a database in my life, so we obviously come from very different angles :-)

> (unlike mutexes where if pthread_cancelled() you could very well have a permanently locked mutex).

Again, there is almost never a good reason to use pthread_cancel() in the first place.

> opening up a pipe or socket to syslogd

In my projects, logging means "print to stderr or write to a file" :-)

> If the parent of a process group is terminated, SIGHUP is sent to its children. Catch the signal then terminate.

So you need to make a process group... What if your code should be cross platform? Do you know how to do this on Windows?

> No such signaling exists in pthread_blah world

This kind of signaling does not exist because it is not necessary. Tasks either periodically check a boolean flag or get notified via the queue itself. Here is a random simple example: https://openframeworks.cc/documentation/utils/ofThreadChanne....

Also, I'm not sure why you keep talking about pthreads... Modern programming languages have their own (portable) threading abstractions.

dragontamer · on Feb 21, 2023

> Wow. And that is somehow easier than using, say, a concurrent collection? (I have never used a database in my life, so we obviously come from very different angles :-)

When it comes to understanding locks, concurrency, and parallelism with shared data between threads... yeah. In my experience, the database is heavy lifting but absolutely ensures that most issues are taken care of.

But as I stated earlier: lighter weight solutions, such as flock() exist for a reason. If the database is too heavy (note: sqlite3 is extremely lightweight, so I bet it works for most cases), then flock() a file and read/writing to it works too.

What flock() gets you is that 100% certainty about cleanups upon strange exit cases that threads do not get you. A flock() always cleans itself up on process termination. No guarantees about mutexes (or other issues) on thread-cancellations or other thread-related issues. (I dunno, oom killer)

> Again, there is almost never a good reason to use pthread_cancel() in the first place.

On the contrary. The pthread community knows that pthread_cancel() is poorly behaved and constantly tells beginner programmers not to use it.

"There's no good reason" because everyone knows that the number of traps in using that function are legion. Its never worthwhile to use that function because it just leads to severely buggy behavior in practice.

> So you need to make a process group... What if your code should be cross platform? Do you know how to do this on Windows?

... you know that Win32 doesn't support pthreads, right? And C++ std::thread doesn't support anything that we've talked about either.

To answer your question: Win32 job objects. Every reasonable modern OS supports the concept of sessions (Windows just calls them job objects instead). The OS-level (be it session leaders / session groups in Linux, or Job objects in Windows) is the correct solution to this problem.

EDIT: Found the function name for ya: https://learn.microsoft.com/en-us/windows/win32/api/jobapi2/...

> This kind of signaling does not exist because it is not necessary. Tasks either periodically check a boolean flag or get notified via the queue itself. Here is a random simple example: https://openframeworks.cc/documentation/utils/ofThreadChanne....

> Also, I'm not sure why you keep talking about pthreads... Modern programming languages have their own (portable) threading abstractions.

And you damn well know that under a thread-kill or thread-cancel scenario, this code stops functioning. While all the code I talked above will function correctly even in the worst-case "kill -9" SIGKILL.

Whatever underlying synchronization that channel is using to synchronize thread access will stop working if a pthread_mutex_lock() is called, but its corresponding pthread_mutex_unlock() fails to be called due to cancellation or other issue.

spacechild1 · on Feb 21, 2023

> On the contrary. The pthread community knows that pthread_cancel() is poorly behaved and constantly tells beginner programmers not to use it.

Now tell me: how should it behave?

> Its never worthwhile to use that function because it just leads to severely buggy behavior in practice.

Well yeah. That's why we don't use it. There is no possible sane way to implement this safely. But that's not the point. Cancelling a thread is like pulling the plug on your PC. It's obviously not the right way to stop a thread. What you should do - and everybody is doing in practice - is telling the code in the thread to return. Just like you ask your OS to gracefully shutdown your PC. In the example I gave above (ofThreadChannel) this is as simple as calling the close() method.

(In the same way, sending SIGKILL isn't the proper way to stop a subprocess either. It is the last resort.)

> ... you know that Win32 doesn't support pthreads, right?

There are in fact pthread implementations/wrappers for Windows, e.g. libwinpthread from the mingw64 project. pthreads is short for POSIX threads, it is not tied to a particular OS.

> And C++ std::thread doesn't support anything that we've talked about either.

You mean something like pthread_cancel? Of course it does not. We do not need it.

> To answer your question: Win32 job objects.

The question was more rhetoric... but thanks :-)

> And you damn well know that under a thread-kill or thread-cancel scenario, this code stops functioning. While all the code I talked above will function correctly even in the worst-case "kill -9" SIGKILL.

You are right that interrupting a thread can be desastrous. However, I never ever needed to do this... and I have written a lot of multi-threaded code.

> Whatever underlying synchronization that channel is using to synchronize thread access will stop working if a pthread_mutex_lock() is called, but its corresponding pthread_mutex_unlock() fails to be called due to cancellation or other issue.

What other issues? Never had this problem... neither do all the heavily multi-threaded programs I use daily.

You somehow try to paint threads as fragile based on some obscure pthreads feature that almost nobody uses in practice...

dragontamer · on Feb 22, 2023

> There are in fact pthread implementations/wrappers for Windows, e.g. libwinpthread from the mingw64 project. pthreads is short for POSIX threads, it is not tied to a particular OS.

I've had poor experience with MingW's implementation of pthreads. I've always preferred Window's native threads instead if I were on Win32. Yes, it means rewriting pthread_create code from Linux into CreateThread in Win32, but its better than the alternative.

There's just a whole bunch of "Win32-isms" that don't really make sense with how Linux assumes pthreads to work.

I'm glad that C++ std::thread exists now and is my preference these days.

> You somehow try to paint threads as fragile based on some obscure pthreads feature that almost nobody uses in practice...

Tell me. Do you use RAII?

All I'm trying to point out is that processes are RAII for _almost every known resource_ in your program.

No need to "pthread_cleanup_push" or pop those cleanup handlers. No need to figure out where pthreads could fail under cancellation points or other such obscure error conditions.

When a process exits, all FDs are closed, SIGHUP / SIGCHLD are sent to the awaiting processes as expected, and all sorts of well specified cleanup occurs.

In pthread-land, its 100% manual. You have to identify every single case and properly use them (and properly pthread_cleanup_push) to RAII the codebase into a clean state.

-------

If you've never had such cleanup issues in multithreaded code... then I hope you never come across it. But in my experience, the additional "free RAII" factor of Linux (or Windows) that is given to a full process (instead of a Thread) really improves the reliability of my programs.

So unless I have to use Threads (and I'm very well acquainted with the tools available in Thread space), I prefer using those RAII-like process cleanup functions.

God forbid an exception has an exception inside of itself in one of your threads and causes a termination condition you weren't expecting (std::terminate), or some other obscure case happens. The "free cleanup" on processes handle these sorts of obscure deaths much better than threads handle it.

spacechild1 · on Feb 22, 2023

> No need to "pthread_cleanup_push" or pop those cleanup handlers. No need to figure out where pthreads could fail under cancellation points or other such obscure error conditions.

Well, I haven't needed any of these, simply because nobody cancels my threads :-)

> When a process exits, all FDs are closed, SIGHUP / SIGCHLD are sent to the awaiting processes as expected, and all sorts of well specified cleanup occurs.

By default, the subprocess just terminates. Yes, any OS resources are eventually returned, but my C++ destructors do not run. I wouldn't call this "specified cleanup". For proper cleanup you would first have to install a signal handler and figure out a way how to make the code in the main thread return gracefully. A bit similar to how we stop threads, but much more complicated and non-portable. (A better way to stop subprocesses is to send a message, assuming we have a pipe or socket.)

> So unless I have to use Threads (and I'm very well acquainted with the tools available in Thread space), I prefer using those RAII-like process cleanup functions.

And I'm perfectly fine with standard C++ RAII classes. std::ostream doesn't care whether it is created and destroyed on the main thread or some auxiliary thread.

> God forbid an exception has an exception inside of itself

What?

Also, what does this have to do with threads? Yes, programs can crash in various ways, but it does not matter in which thread this happens. If any thread crashes, the whole program crashes.

dragontamer · on Feb 22, 2023

When I call "wait" on a process, what do I know about it? I know that all fds have closed, all sockets have closed, etc. etc. A hell of a lot more than the thread example.

When I call pthread_join() on a thread, what do I know about it? Nothing. You have to assume it cleaned itself up correctly.

That's all I'm trying to point out. There's literally no assurances on pthread_join() or pthread terminations or pthread cancellations. None at all.

----------

Whether or not you wish to take advantage of this or not is your choice. If you want to just presume that pthreads always clean themselves up without any issues or subtle threading bugs to the other threads... sure. Or that we're perfect programmers who never make such mistakes...

But I have learned time and time again: if its not me who makes such a mistake, then maybe a coworker who is touching the codebase. Having 100% assurances (like the fd closures) on processes is something I can absolutely build code around without assuming the internal state or "correctness" of other people's code (including my own).

> And I'm perfectly fine with standard C++ RAII classes.

Then you should be comfortable with process cleanups and how to code around it to form strong guarantees of correctness. Enforced by Linux (even stronger than the compiler).

kill and pthread_cancel may prevent C++ destructors from running. But even kill -9 will correctly clean up processes (including the SIGCHLD signals and other such details).

Is it perfect? Of course not. But its one more assurance you lose when you go for threads instead of processes.

spacechild1 · on Feb 22, 2023

Sorry for being so insistent.

> When I call pthread_join() on a thread, what do I know about it? Nothing. You have to assume it cleaned itself up correctly.

I don't really understand your point. A thread just runs a function. If that function leaks resources, that's a problem with that particular function, not with threads in general. It would be just as problematic when running sequentially.

If you are worried about a particular function or module, sure, you can run it in isolation in a subprocess. But this is completely orthogonal to the topic of parallelism or threads. Just because something runs as a subprocess does not mean it will execute in parallel, it depends on how the subprocess communicates with the parent process.

I mean, if you prefer subprocesses, that's fine, but don't sell them as some kind of silver bullet.

> kill and pthread_cancel may prevent C++ destructors from running.

Sigh. As I said again and again, you wouldn't use pthread_cancel() in the first place. It's like complaining that setjmp() breaks your C++ code. Also, SIGKILL will terminate the whole process, unless caught by a signal handler.

> But even kill -9 will correctly clean up processes

Yes, it will release any OS resources, but that alone is not "correct clean up". For example, C++ destructors won't run. Sending SIGKILL is like pulling the plug on your desktop, you only do it if nothing else works.

dragontamer · on Feb 22, 2023

> I mean, if you prefer subprocesses, that's fine, but don't sell them as some kind of silver bullet.

I've listed 5 different multithreaded techniques at the beginning of this discussion.

1. Processes 2. Mutexes 3. Condition Variables 4. Seq-Cst Atomics 5. Atomics + Memory Barriers

If you've thought I'm listing silver bullets, you're mistaken severely. I'm pointing out that #1 is easier than #2. And #2 is easier than #3, and #3 tends to be easier than #4. Prefer easier solutions over complex solutions.

That's all I'm trying to point out. Of course there's no silver bullets. But there are "preferred" solutions. Generally speaking, easier solutions vs harder solutions.

---------

Processes are easier because they're (partially) isolated. In contrast, threads have no isolation. Heck, one can argue that VMs and Docker are also responses to this isolation issue.

If you're unable to see why that's easier, I dunno how else to explain it. I've given it a bunch of posts.

spacechild1 · on Feb 22, 2023

> I've listed 5 different multithreaded techniques at the beginning of this discussion.

Unsurprisingly, I have a problem with that list as well. It mixes things that belong to different categories. A process is an execution context while the rest are synchronization mechanisms. For example, it is possible to synchronize two processes with atomic variables (in shared memory). Also, the list misses any kind of higher level threading primitive (concurrent queues, channels, futures, etc.). There is a whole world above explicit mutex locking.

> I'm pointing out that #1 is easier than #2.

I totally believe you when you say that subprocesses are the preferred solution for your particular use cases. It just does not make sense as general advice. It may be practical that subprocesses are isolated, but you cannot just downplay all the downsides.

JustSomeNobody · on Feb 21, 2023

> I'm not convinced.

Great! Because that's not what I'm trying to do. Figure out what works best for you. We did.

> particularly if the code should be cross-platform.

Ours, currently, does not.

> Then there is the whole issue of message (de)serialization. I don't see how this can possibly be easier than starting a thread and communicating with concurrent queue.

We develop in C#. All of this is available from MS.

spacechild1 · on Feb 21, 2023

> Figure out what works best for you. We did.

I did as well :-)

longcommonname · on Feb 20, 2023

Level 0. Use infra like kafka, and eventing to replicas.

scns · on Feb 20, 2023

I bet someone else has that link at hand where someone does parallel processing in shell with a fracture of the memory and CPU