Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Is parallel programming hard, and, if so, what can you do about it? [pdf] (kernel.org)
146 points by eric_khun on Feb 19, 2023 | hide | past | favorite | 91 comments


1. Try process level I/O, such pipes, sockets, and the like. Have Linux deal with the concurrency problem, not you. (Note: the BASH & background job works in so many cases it ain't funny). Also try fork/join parallelism models like OpenMP. These are all far easier than dipping down to a lower level.

2. Try a mutex

3. If that doesn't work, try adding a condition variable.

4. If that still doesn't work, try an atomic in default sequentially consistent mode or equivalent (ex: Java volatile, InterlockedAdd, and the like). Warning: atomics are very subtle. Definitely have a review with an expert if you are here.

5. If that still doesn't work, consider lock free paradigms. That is, combinations of atomics and memory barriers.

6. If that still doesn't work, publish a paper on your problem lol.

---------

#1 is my most important piece of advice. There was a Blender render I was doing, like 2.6 or something old a few years ago. Blenders parallelism wasn't too good and only utilized 25% of my computer.

So I ran 4 instances of headless Blender. Bam, 100% utilization. Done.

Don't overthink parallelism. It's stupid easy sometimes, as easy as a & on the end of your shell command.


The Oracle database has adopted process-level parallelism, utilizing System V IPC. Threading is used on Windows for performance reasons, but each client gets its own server pid by default on UNIX.

This architecture expresses the original design intentions of "Columbus UNIX."

"CB UNIX was developed to address deficiencies inherent in Research Unix, notably the lack of interprocess communication (IPC) and file locking, considered essential for a database management system... The interprocess communication features developed for CB UNIX were message queues, semaphores and shared memory support. These eventually appeared in mainstream Unix systems starting with System V in 1983, and are now collectively known as System V IPC."

This approach has realized some degree of success.

https://en.m.wikipedia.org/wiki/CB_UNIX


Postgres also uses a multi process architecture. But I think that turned out to be a mistake for something like a database, on modern systems.

There are other reasons, but the biggest problem is that inter process context switches are considerably more expensive than intra process ones. Far less efficient use of the TLB being a big part of that. It used to be worse before things like process context identifiers, but even with them you're wasting a large portion of the TLB by storing redundant information.


Oracle still managed to be the TPC-C performance leader from 12/2010 until Oceanbase took the crown in 2019 (I don't think that Oracle cares anymore).

https://www.tpc.org/tpcc/results/tpcc_results5.asp?print=fal...

https://www.alibabacloud.com/blog/oceanbase-breaks-tpc-c-rec...

They did this with an SGA (Shared Global Area) that (I'm assuming) pollutes the Translation Lookaside Buffer (TLB) with different addresses for this shared memory in every process.


Yea, it's not something that's going to stop you dead in your tracks. But it does show up noticeably in profiles. There is stuff that you can do to reduce the pain, like using gigantic pages for the shared memory and remapping your executable's read-only sections, at runtime, so that you're using huge pages for your code. But even after that you can see the cost noticeably.


Right. One of several bits of architectural friction that didn’t matter while your large DBMS was I/O constrained.


> Right. One of several bits of architectural friction that didn’t matter while your large DBMS was I/O constrained.

Yep. Lots of architectural design out there based on IO latency being higher by an order of magnitude or two than now, while memory latency only shrank modestly.

To be clear, using one process with loads of threads has its own set of issues. Much easier to hit contention in the kernel, e.g. the infamous mmap_sem in linux.


Wouldn't you just have a process per core and avoid context switches that way? Or does it require a process pool with more processes than that because everything is I/O bound?


That's not a panacea either. You either need the client connection file descriptor in another process, or incur the overhead of marshalling the query + query results over IPC.


Interesting. Can't the different processes all listen for connections on the same socket (e.g. using EPOLL_EXCLUSIVE or somesuch)? I kind of agree with you though that if you need to have some kind of fine-grained coordination between processes using shared memory then the failure of one process could still leave the shared memory in a bad state. I suppose you reduce the blast radius a little for memory corruption issues when programming in something like C, but these days if you are using a type safe language I'm not convinced it buys you much in terms of fault tolerance. But I am really keen to understand if you actually lose anything in terms of performance.


> I suppose you reduce the blast radius a little for memory corruption issues when programming in something like C, but these days if you are using a type safe language I'm not convinced it buys you much in terms of fault tolerance.

That's an often cited advantage, and I think it's largely bogus. In postgres the only process that benefits from that is the "supervisor" process ("postmaster"). If any of the other processes crash / exit unexpectedly, all others are restarted as well, and the database is restarted via crash recovery. The supervisor could stay a distinct process even when otherwise using threads. (And no, restarting in systemd etc doesn't achieve the same, we continue to hold onto sockets and shared memory across such crash restarts)

> But I am really keen to understand if you actually lose anything in terms of performance.

Much higher TLB hit rate - separate processes will have separate entries in the TLB, leading to lower hit rates.

Even aside from TLB handling, cross thread context switches are cheaper than cross process ones.

Ability to dynamically change the amount of shared memory, without needing to play tricks with pre-reserving memory ranges, or dealing with relative pointers, makes it easier to improve performance in a lot of areas.

Not needing file descriptors open N times makes things cheaper.

Ability to easily hand of file descriptors from one thread to another makes it easier to achieve higher CPU utilization.


> utilizing System V IPC

Hmm, that's a bit more complex than what I'd put at #1. I'd probably put System V IPC closer to #2 ("use a mutex") levels of complications.

System V Shared memory + Semaphores is definitely "as complicated" as pthread mutexes and semaphores.

But messages, signals, pipes, and other process-level IPC is much simpler. I guess SystemV IPC exists for that shady region "between" the high level stuff, and the complex low-level mutexes / semaphores.

Maybe "1.75", if I were to put it in my list above somewhere. Closer to Mutexes in complexity, but still simpler in some respects. Depends on what bits of System V IPC you use, some bits are easier than others.

---------

The main benefit of processes is that startup and shutdown behavior is very well defined. So something like a pipe, mmap, and other I/O has a defined beginning and end. All sockets are closed() properly, and so forth.

SystemV throws a monkey wrench into that, because the semaphore or shared memory is "owned by Linux", so to speak. So a sem_post() is not necessarily going to be sem_wait(), especially if a process dies in a critical region.


4 is a mistake. The fundamental primitive for multiprocessing is message passing and release/acquire is just that, basically release is send and acquire is receive. If you have to go lock free, there are well-known patterns to communicate from one thread to another, and you should use those instead of just a sequentially consistent atomic.

The best solution, however, is just to split your data and use coarse-grained mutexes.


> and you should use those instead of just a sequentially consistent atomic.

Ehhh... sometimes the best solution to the "bank account parallelism" problem is just:

    atomic_int bobs_bank_account_balance;

    // Thread#1
    bobs_bank_account_balance += 100; // Depositing $100 in a sequentially consistent way.


    // In Thread#2
    bobs_bank_account_balance -= 100; // Withdrawing $100 in a sequentially consistent way.
No reason to bring in acquire vs release barriers or anything more complex. Just... atomically add and atomically subtract as needed. Not all cases are this simple, but many cases are. So you might as well try this and see if it is good enough.

If not, then yeah, you move onto more complex paradigms. But always try the dumb and simple solutions first, before trying the harder stuff.

----------

This case is super common, that its even optimized in GPU programming. I've seen atomics like this become optimized into a prefix-sum routine by the compiler.

Yes, this means you can have thousands of GPU-threads / shaders performing atomic adds / subtracts in GPU-space, and the atomic will be surprisingly efficient.

The problem is that this paradigm doesn't always work. It takes skill to know when paradigms fail or succeed, and its sometimes very subtle. (That's why I say: try this, but... speak with an expert when doing so). There might be a subtle race condition. But in the cases where this works, absolutely program in this way.


The question is what the invariants are around those operations. It is rarely the case that you can get away with simple RMW operations, because they don't guarantee any invariants. Also, sequentially consistent RMW atomic operations don't order with non-sequentially consistent atomics (the exception being the seqcst fence) so it's hard to construct send/receive operations using seqcst atomics—if you can use them, chances are that even relaxed could be enough!

Going deeper into the atomic add example, are you sure that the cache line bouncing will not be an issue? can you perhaps make the code just update something that you already have exclusive access to, and sum multiple values when you do a read (hopefully it's rare, e.g. reading a statistic once a second)? So again the solution could be to use a mutex and split the data so that the mutex is mostly uncontended.


> Also, sequentially consistent RMW atomic operations don't order with non-sequentially consistent atomics

So just use sequentially-consistent atomics everywhere, unless otherwise needed.

_No one_ should be itching to touch that acquire/release paradigm unless you really have to. Its grossly more complex, and very few programmers understand it.

Acquire/release exists because its necessary. (Ex: implementation of spinlocks/mutexes). But its a tool no one should feel good about using, its very low level, very subtle, and full of potential traps.

A good acquire/release lock-free algorithm or data-structure is still a PH.D thesis level material these days. Its obscure, uncommon, and difficult to write. Don't do it unless you have to. And if you have to, try all the patterns that have been figured out already before innovating.

> Going deeper into the atomic add example, are you sure that the cache line bouncing will not be an issue?

Do you mean false sharing?

False sharing is a performance issue. Your code will be correct, albeit slower. That's fine. Furthermore, acquire/release doesn't do anything to solve false sharing, you need to change your memory layout so that objects are on different cache lines.

> So again the solution could be to use a mutex and split the data so that the mutex is mostly uncontended.

We're only at "#4" because "#1, #2, and #3 have failed". If you can solve things with a mutex, slap down a mutex and call it done. Don't reach for the more complex tools unless necessary.


Yes, if you can make do with a single atomic-sized object, you can perform any RMW on it with either a CAS loop or a special-cased atomic operation (like add or subtract) and not need any further synchronization. What the GP commenter described as being potentially dangerous and nedding expert knowledge is going in any way beyond that. It's really easy to e.g. trigger ABA problems and other issues without realizing it. So just use a mutex to synchronize access instead.


OpenMP is so dead simple it's insane. Had a class on Parallel Computing (mainly for super computers / scientific computing) and while at the beginning I thought it'd be super hard, in the end it was just slapping #pragma omp parallel on everything


There is also: Try factoring out pure functions and run those in parallel.


This is my favourite technique.

Keep state in messages. Keep functions pure.

There can be drawbacks but for most types of tasks this works very well.


This sounds interesting. Can you share an example piece of code?


The general idea is to have a worker/handler which has one or more pure functions to perform some piece of work.

This worker should be stateless. Or if does need state you should put into a centralized location which supports concurrent access (e. g. A DB technically no longer pure).

Your orchestration script takes gives tasks to your workers and collects the result. The key idea is you can horizontally scale the amount of workers because they are stateless.

A simple example of this might be a an orchestrator script that creates a bunch of tasks (e.g. adding two numbers) and pushes it to a input queue. The workers take a task off the queue, and push the result on the result queue. Orchestrator takes the result off and aggregates all the results.

A more complex example is a webserver communicating through RPCs.


Not GP but I think the fundamental idea is to think about "who calls who" and have pure functions be called/scheduled by a stateful coordination mechanism.

Usually queues/channels/event loops are involved at the top level especially if you're doing async IO. If you're doing parallel computation then you'd probably use fan-in/fan-out/waitgroup logic at the top that calls into pure functions. The general point is "only this piece of code worries about coordinating state".

For async IO think Go's concurrency primitives, Clojure's core.async or Erlang/Elixir as examples of coordinating state through messages.

For parallelism specifically look at this pretty cool rust crate:

https://lib.rs/crates/rayon

Specifically at the discussion in the FAQ to get a sense of why it's useful to push mutability to the top level:

https://github.com/rayon-rs/rayon/blob/master/FAQ.md

As a counterpoint I'd like to link to a discussion about FP where Martin Thompson wrote:

https://groups.google.com/g/mechanical-sympathy/c/gSkbc3grzN...

> Most FP persistent data structures take the form of a tree and perform a path copy from leaf to root for the change, reusing as much of the tree as possible. What this results in is a single contention point at the root of the tree for CAS operations. Now expand this view to a larger domain model of many entities, e.g. Customers, Orders, Products, etc., all interrelated and that model needs one or more entry points. Each of these entry points become a contention hotspot as the model is mutated. Also with trees and linked lists being the underlying data structures, performance is limited due to indirection causing cache misses.

So really it matters a lot what you're doing. FP is not a panacea for concurrency and parallelism. As an application programmer that does mostly IO and very little heavy calculation etc. it's for example great.


> > Most FP persistent data structures take the form of a tree and perform a path copy from leaf to root for the change, reusing as much of the tree as possible. What this results in is a single contention point at the root of the tree for CAS operations. Now expand this view to a larger domain model of many entities, e.g. Customers, Orders, Products, etc., all interrelated and that model needs one or more entry points. Each of these entry points become a contention hotspot as the model is mutated. Also with trees and linked lists being the underlying data structures, performance is limited due to indirection causing cache misses.

But the root of a tree is not mutated. Not mutating anything in the whole tree is the whole point. It will be there, until garbage collected (since no one has a reference to it any longer, but only to other versions / other roots). Not sure I am understanding what he is saying correctly.

Maybe he is talking about the kind of "reduce step" which one has to deal with, when all the parallel running functions return results and how to use the results? They might make a new tree-like thing. Sometimes there is no way around a sequential section in the program, since the problem is inherently sequential. But I don't see the contention hotspots.


>Try process level I/O, such pipes, sockets, and the like.

This.

> Have Linux deal with the concurrency problem, not you.

Not just Linux. We did this with our Windows app rewrite. IPC with pipes is fast as hell, just works, and it greatly simplified parallelism for us.


But why do you need a seperate process? You can do the same with threads and queues. The only advantage I can think of is sandboxing, i.e. preventing a misbehaving task from taking down your whole app.


Ease of development. Ease of maintenance. Separation of concerns. Just to name a few. For example, one service is for hardware communication. It's job is to communicate with all the various devices, and forward those messages out.

>You can do the same with threads and queues.

Yes, but as OP up the chain mentions, you can simplify by letting the OS handle some of that.


> Ease of development. Ease of maintenance. Seperation of concerns.

I'm not convinced. Managing processes is much more complex than managing threads, particularly if the code should be cross-platform. You need a very good library to hide away all the nasty OS specific details. Same goes for pipes or sockets. Then there is the whole issue of message (de)serialization. I don't see how this can possibly be easier than starting a thread and communicating with concurrent queue.

I mean, subprocesses certainly have their use cases, but I would never see them as a drop-in replacement for threads.

> you can simplify by letting the OS handle some of that.

Processes and threads are both OS resources. For the OS scheduler they are practically equivalent.


Consider: what is the easiest way to parallelize GCC to a 32 core system?

Answer: make -j128.

Way easier than trying to make every data structure inside of the compiler into a parallel programming model.

--------

Process level parallelism is a higher level of thinking. As long as you have spare RAM (and let's be frank, we all have 32GB+ sitting around these days), you can add more and more processes to solve your problem.

If you are talking data structures and mutexes, you are working at a far more complex layer than what I'm saying for #1.

----

Even completely single threaded code can be run in parallel in many cases in practice, because we have multiple files or other divisions of labor available at the process / user level.


> Consider: what is the easiest way to parallelize GCC to a 32 core system?

> Answer: make -j128.

> Way easier than trying to make every data structure inside of the compiler into a parallel programming model.

That's a false dichotomy. If the compiler program was implemented as a library, I would rather create 128 threads that each call compileSourceFile() than spawn 128 subprocesses that do the same thing.

The question you should be asking is: do I need my task to execute in a seperate address space? If yes, spawn a subprocess, otherwise use a thread.


> I would rather create 128 threads that each call compileSourceFile() than spawn 128 subprocesses that do the same thing.

Would you rather write "compileSourceFile()" in a reentrant way (ie: no global variables, no static variables, guaranteed reentrancy, and other such requirements to work in a typical pthread manner)... or would you rather have processes where all those things are fine and not bugs?

The minute you start up threads with implicitly shared memory spaces... the minute "singleton pattern" suddenly grows complex.

> The question you should be asking is: do I need my task to execute in a seperate address space? If yes, spawn a subprocess, otherwise use a thread.

On the contrary. Separate address spaces by default is far easier. Thread#45 going crazy due to buffer-overflows will demolish thread#25.

But process#45 with a buffer-overflow will not affect process#25.

I/O is also grossly simplified inside the process model. Closing out a process closes() all sockets, pipes, and file I/O automatically, no matter how the process dies. (Ex: Segfaults, kill -9, etc. etc. are all handled gracefully).

If one thread dies, for whatever reason, your program is extremely hosed. Its very difficult to reason where the legitimate state of your multithreaded data-structures is in.

Threads are far more efficient, yes. So if you need efficiency, use them. But most people in my experience are Python or PHP programmers (or other such high level language), where it is clear that performance isn't an issue.


> Would you rather write "compileSourceFile()" in a reentrant way (ie: no global variables, no static variables, guaranteed reentrancy, and other such requirements to work in a typical pthread manner)... or would you rather have processes where all those things are fine and not bugs?

Certainly the former. There is a good reason why you should avoid global state (if possible). I never found it to be particularly hard...

> On the contrary. Separate address spaces by default is far easier. Thread#45 going crazy due to buffer-overflows will demolish thread#25.

As I noted, sandboxing is a valid use case for subprocesses. But this is completely orthogonal to the topic of threads! A buffer overflow can do all sorts of crazy things even in a single-threaded environment and you can totally use sandboxing in sequential code.

> If one thread dies, for whatever reason, your program is extremely hosed.

If one thread crashes, the whole process dies. There is no consistency problem here.

> But most people in my experience are Python or PHP programmers

Ok, I have been rather thinking about languages with first-class threading support (C, C++, Rust, Java, C#, etc.). Most scripting languages do have very limited multi-threading support (or none at all). In Python, for example, it often isn't even possible to achieve CPU level parallelism with threads because of the GIL, so you have to use subprocesses for that.


> Certainly the former. There is a good reason why you should avoid global state (if possible). I never found it to be particularly hard...

It only takes one library call into a non-reentrant function to mess everything up in a threading environment. Things are mostly thread-safe today, but I still fall into the trap. Its not like we're double-checking our 3rd party libraries all the time.

In any case, the "Multithreaded Singleton" problem is devilishly difficult to write and full of subtleties. I disagree very strongly about threads making things easier. The singleton pattern is written incorrectly in almost every instance I've seen it in the wild. Global state is important in many cases and cannot be avoided, and threads absolutely complicate it even more than usual.

----------

But lets reverse things for a second. I've listed off multiple kinds of code that work in a process-environment but not in a threading-environment.

You haven't listed off what makes threads actually easier yet. You're saying "mutexes and queues" are easier, and I disagree. At a minimum, a pipe / FIFO performs a similar role as a queue and even has atomic-level guarantees (on read/writes of PIPE_BUF or smaller). Sockets provide client/server model as well.

What I can say for sure, is that pthreads / mutexes / queues are _more efficient_ than pipes / FIFOs / etc. etc. But "simplicity"? Its really not so difficult to read() or write() from a pipe.

> If one thread crashes, the whole process dies. There is no consistency problem here.

Are you sure?

Lets say I pthread_cancel() one of your threads. Is that cool?

Lets compare / contrast with kill(SIGKILL), which is still dangerous but... the state of a process is far more consistent. All fds are closed (including pipes, files, and sockets). This has a side effect of cleaning up flock().

There's a couple of complications involving SystemV semaphores, but even this has been figured out with semaphore adjustment values (which are automatically applied upon process exit, even from a kill).

---------

If thread #45 accepts() a connection, then gets pthread_cancel()d, that socket will effectively be leaked and live forever (because there's no way for anyone else to close() that socket correctly). Especially if you have a PTHREAD_CANCEL_ASYNCHRONOUS flag, these sorts of things can be devilishly hard to debug.


> In any case, the "Multithreaded Singleton" problem is devilishly difficult to write and full of subtleties.

First off, global state does not necessarily require the singleton pattern.

But let's assume that we really need it, e.g. to create a global resource only on demand. This is how it's done in C++:

  class Foo {
  public:
      static Foo& getInstance() {
          // Since C++11, local static variable initialization is thread-safe!
          static Foo instance;
          return instance;
      }
  private:    
      Foo() {
          // expensive constructor
      }
  };
I mostly use C++ and various scripting languages, but judging from the examples in https://en.wikipedia.org/wiki/Double-checked_locking it seems like C#, Java and Go all have at least one simple and safe method to achieve this.

> Global state is important in many cases and cannot be avoided, and threads absolutely complicate it even more than usual.

If you depend on global state in the parent process, how can the subprocesses even operate, since they do not have access to that state? Yes, certain resources, such as loggers, need to be global, but these should really be thread-safe anyway.

> You haven't listed off what makes threads actually easier yet.

* creating and joining threads (in a portable way!) is trivial in most programming languages; creating and joining subprocesses not so much * exchanging data much easier, no marshalling needed * logging is much easier * error handling is much easier * debugging is much easier. (How do you debug a short lived subprocess? The process terminates before the debugger even has a chance to attach to it.)

> You're saying "mutexes and queues" are easier, and I disagree.

Maybe I was not clear, I was really talking about concurrent queues. The producer can push messages, the consumer waits on messages and processes them. It works basically like a pipe/FIFO, but without the pitfalls.

> > If one thread crashes, the whole process dies. There is no consistency problem here. > Are you sure? > Lets say I pthread_cancel() one of your threads. Is that cool?

I was talking about threads crashing.

> If thread #45 accepts() a connection, then gets pthread_cancel()d,

pthread_cancel() is dangerous, I agree. Generally, you should not use it. (I have never needed it.) There are much saner ways to "cancel" a thread, depending on your language. Often it's enough to periodically check a boolean flag.

> Lets compare / contrast with kill(SIGKILL), which is still dangerous but... the state of a process is far more consistent.

Yeah, it is easy to kill a subprocess. But let's consider the opposite: what happens if the parent process dies? How do you make sure that a long running subprocess automatically terminates? It is definitely not trivial.

---

It's fine if you like subprocesses. I just wanted to challenge the notion that they are somehow easier to use than threads - which just doesn't match my experience. We are probably working in entirely different domains, so it's natural that our experiences differ.


> it seems like C#, Java and Go all have at least one simple and safe method to achieve this.

Its only simple after you've studied double-checked locking. Initial attempts often lead to failure.

> If you depend on global state in the parent process, how can the subprocesses even operate, since they do not have access to that state?

Plenty of ways to get access. The #1 way is probably to use a database to share that state in a concurrency-safe way. sqlite3 works, though postgresql is more scalable.

There are also solutions that require less resources: flock() a file and then read the shared state in a manner that's cohesive across processes. If your process dies while flock()ing something, the flock() automatically undoes (unlike mutexes where if pthread_cancelled() you could very well have a permanently locked mutex).

That's why so many systems have a database + dedicated process that handles this kind of shared global state between processes.

> Yes, certain resources, such as loggers, need to be global, but these should really be thread-safe anyway.

Loggers are an excellent example of where opening up a pipe or socket to syslogd is far easier than trying to shoehorn in a mutex+queue across threads.

> Yeah, it is easy to kill a subprocess. But let's consider the opposite: what happens if the parent process dies? How do you make sure that a long running subprocess automatically terminates? It is definitely not trivial.

If the parent of a process group is terminated, SIGHUP is sent to its children. Catch the signal then terminate.

So once again: processes handle both situations (parent dies, kill children. Or children die, notify parent), with SIGHUP and SIGCHLD respectively.

No such signaling exists in pthread_blah world. You're (trying to) argue about the "superiority" of threads when processes have all of these issues 100% figured out, while the pthread-world is completely ignorant to these issues.


> The #1 way is probably to use a database to share that state in a concurrency-safe way. sqlite3 works, though postgresql is more scalable.

> There are also solutions that require less resources: flock() a file and then read the shared state in a manner that's cohesive across processes.

Wow. And that is somehow easier than using, say, a concurrent collection? (I have never used a database in my life, so we obviously come from very different angles :-)

> (unlike mutexes where if pthread_cancelled() you could very well have a permanently locked mutex).

Again, there is almost never a good reason to use pthread_cancel() in the first place.

> opening up a pipe or socket to syslogd

In my projects, logging means "print to stderr or write to a file" :-)

> If the parent of a process group is terminated, SIGHUP is sent to its children. Catch the signal then terminate.

So you need to make a process group... What if your code should be cross platform? Do you know how to do this on Windows?

> No such signaling exists in pthread_blah world

This kind of signaling does not exist because it is not necessary. Tasks either periodically check a boolean flag or get notified via the queue itself. Here is a random simple example: https://openframeworks.cc/documentation/utils/ofThreadChanne....

Also, I'm not sure why you keep talking about pthreads... Modern programming languages have their own (portable) threading abstractions.


> Wow. And that is somehow easier than using, say, a concurrent collection? (I have never used a database in my life, so we obviously come from very different angles :-)

When it comes to understanding locks, concurrency, and parallelism with shared data between threads... yeah. In my experience, the database is heavy lifting but absolutely ensures that most issues are taken care of.

But as I stated earlier: lighter weight solutions, such as flock() exist for a reason. If the database is too heavy (note: sqlite3 is extremely lightweight, so I bet it works for most cases), then flock() a file and read/writing to it works too.

What flock() gets you is that 100% certainty about cleanups upon strange exit cases that threads do not get you. A flock() always cleans itself up on process termination. No guarantees about mutexes (or other issues) on thread-cancellations or other thread-related issues. (I dunno, oom killer)

> Again, there is almost never a good reason to use pthread_cancel() in the first place.

On the contrary. The pthread community knows that pthread_cancel() is poorly behaved and constantly tells beginner programmers not to use it.

"There's no good reason" because everyone knows that the number of traps in using that function are legion. Its never worthwhile to use that function because it just leads to severely buggy behavior in practice.

> So you need to make a process group... What if your code should be cross platform? Do you know how to do this on Windows?

... you know that Win32 doesn't support pthreads, right? And C++ std::thread doesn't support anything that we've talked about either.

To answer your question: Win32 job objects. Every reasonable modern OS supports the concept of sessions (Windows just calls them job objects instead). The OS-level (be it session leaders / session groups in Linux, or Job objects in Windows) is the correct solution to this problem.

EDIT: Found the function name for ya: https://learn.microsoft.com/en-us/windows/win32/api/jobapi2/...

> This kind of signaling does not exist because it is not necessary. Tasks either periodically check a boolean flag or get notified via the queue itself. Here is a random simple example: https://openframeworks.cc/documentation/utils/ofThreadChanne....

> Also, I'm not sure why you keep talking about pthreads... Modern programming languages have their own (portable) threading abstractions.

And you damn well know that under a thread-kill or thread-cancel scenario, this code stops functioning. While all the code I talked above will function correctly even in the worst-case "kill -9" SIGKILL.

Whatever underlying synchronization that channel is using to synchronize thread access will stop working if a pthread_mutex_lock() is called, but its corresponding pthread_mutex_unlock() fails to be called due to cancellation or other issue.


> On the contrary. The pthread community knows that pthread_cancel() is poorly behaved and constantly tells beginner programmers not to use it.

Now tell me: how should it behave?

> Its never worthwhile to use that function because it just leads to severely buggy behavior in practice.

Well yeah. That's why we don't use it. There is no possible sane way to implement this safely. But that's not the point. Cancelling a thread is like pulling the plug on your PC. It's obviously not the right way to stop a thread. What you should do - and everybody is doing in practice - is telling the code in the thread to return. Just like you ask your OS to gracefully shutdown your PC. In the example I gave above (ofThreadChannel) this is as simple as calling the close() method.

(In the same way, sending SIGKILL isn't the proper way to stop a subprocess either. It is the last resort.)

> ... you know that Win32 doesn't support pthreads, right?

There are in fact pthread implementations/wrappers for Windows, e.g. libwinpthread from the mingw64 project. pthreads is short for POSIX threads, it is not tied to a particular OS.

> And C++ std::thread doesn't support anything that we've talked about either.

You mean something like pthread_cancel? Of course it does not. We do not need it.

> To answer your question: Win32 job objects.

The question was more rhetoric... but thanks :-)

> And you damn well know that under a thread-kill or thread-cancel scenario, this code stops functioning. While all the code I talked above will function correctly even in the worst-case "kill -9" SIGKILL.

You are right that interrupting a thread can be desastrous. However, I never ever needed to do this... and I have written a lot of multi-threaded code.

> Whatever underlying synchronization that channel is using to synchronize thread access will stop working if a pthread_mutex_lock() is called, but its corresponding pthread_mutex_unlock() fails to be called due to cancellation or other issue.

What other issues? Never had this problem... neither do all the heavily multi-threaded programs I use daily.

You somehow try to paint threads as fragile based on some obscure pthreads feature that almost nobody uses in practice...


> There are in fact pthread implementations/wrappers for Windows, e.g. libwinpthread from the mingw64 project. pthreads is short for POSIX threads, it is not tied to a particular OS.

I've had poor experience with MingW's implementation of pthreads. I've always preferred Window's native threads instead if I were on Win32. Yes, it means rewriting pthread_create code from Linux into CreateThread in Win32, but its better than the alternative.

There's just a whole bunch of "Win32-isms" that don't really make sense with how Linux assumes pthreads to work.

I'm glad that C++ std::thread exists now and is my preference these days.

> You somehow try to paint threads as fragile based on some obscure pthreads feature that almost nobody uses in practice...

Tell me. Do you use RAII?

All I'm trying to point out is that processes are RAII for _almost every known resource_ in your program.

No need to "pthread_cleanup_push" or pop those cleanup handlers. No need to figure out where pthreads could fail under cancellation points or other such obscure error conditions.

When a process exits, all FDs are closed, SIGHUP / SIGCHLD are sent to the awaiting processes as expected, and all sorts of well specified cleanup occurs.

In pthread-land, its 100% manual. You have to identify every single case and properly use them (and properly pthread_cleanup_push) to RAII the codebase into a clean state.

-------

If you've never had such cleanup issues in multithreaded code... then I hope you never come across it. But in my experience, the additional "free RAII" factor of Linux (or Windows) that is given to a full process (instead of a Thread) really improves the reliability of my programs.

So unless I have to use Threads (and I'm very well acquainted with the tools available in Thread space), I prefer using those RAII-like process cleanup functions.

God forbid an exception has an exception inside of itself in one of your threads and causes a termination condition you weren't expecting (std::terminate), or some other obscure case happens. The "free cleanup" on processes handle these sorts of obscure deaths much better than threads handle it.


> No need to "pthread_cleanup_push" or pop those cleanup handlers. No need to figure out where pthreads could fail under cancellation points or other such obscure error conditions.

Well, I haven't needed any of these, simply because nobody cancels my threads :-)

> When a process exits, all FDs are closed, SIGHUP / SIGCHLD are sent to the awaiting processes as expected, and all sorts of well specified cleanup occurs.

By default, the subprocess just terminates. Yes, any OS resources are eventually returned, but my C++ destructors do not run. I wouldn't call this "specified cleanup". For proper cleanup you would first have to install a signal handler and figure out a way how to make the code in the main thread return gracefully. A bit similar to how we stop threads, but much more complicated and non-portable. (A better way to stop subprocesses is to send a message, assuming we have a pipe or socket.)

> So unless I have to use Threads (and I'm very well acquainted with the tools available in Thread space), I prefer using those RAII-like process cleanup functions.

And I'm perfectly fine with standard C++ RAII classes. std::ostream doesn't care whether it is created and destroyed on the main thread or some auxiliary thread.

> God forbid an exception has an exception inside of itself

What?

Also, what does this have to do with threads? Yes, programs can crash in various ways, but it does not matter in which thread this happens. If any thread crashes, the whole program crashes.


When I call "wait" on a process, what do I know about it? I know that all fds have closed, all sockets have closed, etc. etc. A hell of a lot more than the thread example.

When I call pthread_join() on a thread, what do I know about it? Nothing. You have to assume it cleaned itself up correctly.

That's all I'm trying to point out. There's literally no assurances on pthread_join() or pthread terminations or pthread cancellations. None at all.

----------

Whether or not you wish to take advantage of this or not is your choice. If you want to just presume that pthreads always clean themselves up without any issues or subtle threading bugs to the other threads... sure. Or that we're perfect programmers who never make such mistakes...

But I have learned time and time again: if its not me who makes such a mistake, then maybe a coworker who is touching the codebase. Having 100% assurances (like the fd closures) on processes is something I can absolutely build code around without assuming the internal state or "correctness" of other people's code (including my own).

> And I'm perfectly fine with standard C++ RAII classes.

Then you should be comfortable with process cleanups and how to code around it to form strong guarantees of correctness. Enforced by Linux (even stronger than the compiler).

kill and pthread_cancel may prevent C++ destructors from running. But even kill -9 will correctly clean up processes (including the SIGCHLD signals and other such details).

Is it perfect? Of course not. But its one more assurance you lose when you go for threads instead of processes.


Sorry for being so insistent.

> When I call pthread_join() on a thread, what do I know about it? Nothing. You have to assume it cleaned itself up correctly.

I don't really understand your point. A thread just runs a function. If that function leaks resources, that's a problem with that particular function, not with threads in general. It would be just as problematic when running sequentially.

If you are worried about a particular function or module, sure, you can run it in isolation in a subprocess. But this is completely orthogonal to the topic of parallelism or threads. Just because something runs as a subprocess does not mean it will execute in parallel, it depends on how the subprocess communicates with the parent process.

I mean, if you prefer subprocesses, that's fine, but don't sell them as some kind of silver bullet.

> kill and pthread_cancel may prevent C++ destructors from running.

Sigh. As I said again and again, you wouldn't use pthread_cancel() in the first place. It's like complaining that setjmp() breaks your C++ code. Also, SIGKILL will terminate the whole process, unless caught by a signal handler.

> But even kill -9 will correctly clean up processes

Yes, it will release any OS resources, but that alone is not "correct clean up". For example, C++ destructors won't run. Sending SIGKILL is like pulling the plug on your desktop, you only do it if nothing else works.


> I mean, if you prefer subprocesses, that's fine, but don't sell them as some kind of silver bullet.

I've listed 5 different multithreaded techniques at the beginning of this discussion.

1. Processes 2. Mutexes 3. Condition Variables 4. Seq-Cst Atomics 5. Atomics + Memory Barriers

If you've thought I'm listing silver bullets, you're mistaken severely. I'm pointing out that #1 is easier than #2. And #2 is easier than #3, and #3 tends to be easier than #4. Prefer easier solutions over complex solutions.

That's all I'm trying to point out. Of course there's no silver bullets. But there are "preferred" solutions. Generally speaking, easier solutions vs harder solutions.

---------

Processes are easier because they're (partially) isolated. In contrast, threads have no isolation. Heck, one can argue that VMs and Docker are also responses to this isolation issue.

If you're unable to see why that's easier, I dunno how else to explain it. I've given it a bunch of posts.


> I've listed 5 different multithreaded techniques at the beginning of this discussion.

Unsurprisingly, I have a problem with that list as well. It mixes things that belong to different categories. A process is an execution context while the rest are synchronization mechanisms. For example, it is possible to synchronize two processes with atomic variables (in shared memory). Also, the list misses any kind of higher level threading primitive (concurrent queues, channels, futures, etc.). There is a whole world above explicit mutex locking.

> I'm pointing out that #1 is easier than #2.

I totally believe you when you say that subprocesses are the preferred solution for your particular use cases. It just does not make sense as general advice. It may be practical that subprocesses are isolated, but you cannot just downplay all the downsides.


> I'm not convinced.

Great! Because that's not what I'm trying to do. Figure out what works best for you. We did.

> particularly if the code should be cross-platform.

Ours, currently, does not.

> Then there is the whole issue of message (de)serialization. I don't see how this can possibly be easier than starting a thread and communicating with concurrent queue.

We develop in C#. All of this is available from MS.


> Figure out what works best for you. We did.

I did as well :-)


Level 0. Use infra like kafka, and eventing to replicas.


I bet someone else has that link at hand where someone does parallel processing in shell with a fracture of the memory and CPU


Related:

“Is Parallel Programming Hard, and, If So, What Can You Do About It?” v2 Is Out - https://news.ycombinator.com/item?id=26537298 - March 2021 (75 comments)

Is parallel programming hard, and, if so, what can you do about it? - https://news.ycombinator.com/item?id=22030928 - Jan 2020 (85 comments)

Is Parallel Programming Hard, and, If So, What Can You Do About It? [pdf] - https://news.ycombinator.com/item?id=9315152 - April 2015 (31 comments)

Is Parallel Programming Hard, And, If So, What Can You Do About It? - https://news.ycombinator.com/item?id=7381877 - March 2014 (26 comments)

Is Parallel Programming Hard, And, If So, What Can You Do About It? - https://news.ycombinator.com/item?id=2784515 - July 2011 (39 comments)


Multi-threaded programming has been of particular interest to me for decades (since my early years programming for OS/2). Whenever I write code, I look for ways to do things in parallel.

My new data management system is highly parallel. I am always finding tasks that take minutes to complete and getting them down to just seconds (when running on multi-core CPUs) by getting multiple threads working together on the same problem.

Just yesterday, I found a task that was taking over 12 minutes to finish (inserting 125 million key/value pairs into a data store) and was able to get it to do the same task in just 37 seconds (running on my 16 core/32 thread CPU) by spinning off multiple threads.


Use an actor model language -- by far the sanest way. Message passing is intuitive to human experience.

1. Elixir (Erlang)

2. Scala/Akka

3. Pony


I am biased because this is my research area, but I have to respectfully disagree. Actor models are awful, and the only reason it's not obvious is because everything else is even more awful.

But if you look at e.g., the recent work on task-based models, you'll see that you can have literally sequential programs that parallelize automatically. No message passing, no synchronization, no data races, no deadlocks. Read your programs as if they're sequential, and you immediately understand their semantics. Some of these systems are able to scale to thousands of nodes.

An interesting example of this is cuNumeric, which allows you to take sequential Python programs that use NumPy, and by changing one line (the import statement), run automatically on clusters of GPUs. It is 100% pure awesomeness.

https://github.com/nv-legate/cunumeric

(I don't work on cuNumeric, but I do work on the runtime framework that cuNumeric uses.)


>the recent work on task-based models, you'll see that you can have literally sequential programs that parallelize automatically

Can you provide some details please? I am not quite clear what you mean.


If you really want to dig into it you can read up on the tutorials and/or papers from the Legion project: https://legion.stanford.edu/

But briefly, these task-based programs preserve sequential semantics. That means (whatever the system actually does when running your program), as long as you follow the rules, the parallelism should be invisible to the execution of the program.


While not for clusters AFAICT, Rayon (Rust lib) auto parallizes loops. It is used in Programming Rust by O'Reilly in a great example. Forgive the RESF response please.


It is a pity Concurrent ML didn't take off. F# has a great library called Hopac that implements it, but it is 50 times less popular than its closest competitor Rx.

Also seconding that other post. Actor models and async concurrency are only useful if you need to send messages between machines, but otherwise you want to use synchronous concurrency as it is easier to deal with.


It’s not always easy, things can get confusing when you start sending messages to yourself


Came here to say just this. In particular, _immutable_ message passing.


4. D


Section 2.3.3 is definitely worth reading carefully.

I've both increased efficiency and removed bugs by rewriting a system that someone thought the only way to make faster was to add more threads, when optimising algorithms and data layout resulted in much more gains.


There are three ways to 'scale' a computer program. The first is optimizing the algorithms and data structures which is discussed in the section you noted. Make your code run as fast as possible using a single thread even when the amount of data you are processing gets extremely large. The second is taking advantage of better hardware (more cores, bigger L2 and L3 caches, SSDs, etc.) which includes spinning off threads to do single tasks in parallel. The third is a 'scale out' architecture where the application is spread out across multiple machines.

With all the cloud infrastructure available today, too many programmers focus on the third option (which can mask inefficiencies in the first two) when they could get as good or better performance at a cheaper price by focusing on the other options instead.


Its interesting that while Moore's law saturated many years ago there is still no parallel programming style that hits some sweet spot between productivity and performance for multicore cpus (and thus gets more adopted for mainstream development)

Its not clear if this means there is not such "optimum" or simply it is not something anybody cares about

people focused a lot on gpus but thats not easy either


That's because there's really two conflicting goals to parallel programming.

1. Maximizing utilization of CPUs / GPUs / compute resources-- The "obvious" goal. You want as much code running in parallel as possible to accomplish some task faster.

2. Maximizing the utilization of SSDs / Hard Drives / Ethernet / I/O as much as possible -- Less obvious, but in I/O constrained problems, its not so much the CPU you're focused on, as much as it is the I/O you're trying to maximize.

Processes and threads are classically designed to solve #2, _not_ #1. Yes, we abuse processes and threads to make #1 go faster, but it really wasn't their original point.

When you perform a read() on a socket / Hard drive / whatever, it makes sense to "swap out" the process and find something else to do. This is optimizing #2, trying to run as many processes as possible to maximize the number of requests going to your I/O centers.

In contrast, if you're trying to perform a dense matrix-multiplication on AVX512 or GPU space or whatever, all this task-switching is completely useless and processes are detrimental to your goal, not beneficial. Its completely the wrong tool to use.

Bonus points: 4x GPUs working with a CPU (say, 64-core CPU) will run into both #1 and #2 problems simultaneously. Hurrah!

------------

Of course, today there's event driven code, coroutines, Golang threads, fibers, epoll... lots of tools to help you out on these tasks. But as the computer world grows more nuanced, it grows more complex. Its harder to figure out which tool to reach for in your toolbox.


Maybe there should be an IDE plugin that after you run your code once it makes a list of recommendations for parallelisation, including the possibilities "please rewrite this in language X and style Y" or "forget about it"


Intel VTune


> Processes and threads are classically designed to solve #2, _not_ #1.

Do you have a source for this?


Sure. The 1990s.

Single-core processors like the 486, Pentium, Pentium2 and Pentium4 had processes and threads, and many many programmers found them useful in the 1980s and 1990s.

Multicore computers weren't popular until the mid 00s, long after processes and threads were implemented in modern OSes.

------------

That is, for the entire time in the 1980s and 1990s, on single-core processors... processes and threads would _NOT_ speed up your programs (as per #1), because you only had a singular core. Task switching doesn't help at all on a 486 or a Pentium in terms of #1.

But task switching / threads helps in terms of Hard Drives (elevator algorithm), networking, and other I/O issues.

Even today: OSes like FreeRTOS offer you threads and/or processes on single-core systems like STM32G0 and other small microcontrollers.

----------

Windows 95 was when multiprocessing became popular in the Microsoft world. It would take another decade before typical home users had multicores. 99% of the time, a program sat there, waiting for the mouse to move or some other event to happen.


But there is a dominant parallel programming style. In fact, it actually boils down to one of two styles:

* Here's a list of things. Run the same bit on code for every item in the list of things. (Slight adjustment is necessary if you need to something like a reduction tree).

* Here's a graph of tasks, with dependencies expressed as edges. Run as much as you can in parallel.

What makes parallel programming difficult is two main things. First, the way to achieve parallelism is highly dependent on the size of the tasks, with designs for one scale being horribly bad ideas at different scales. Second, there's a pretty severe penalty when communication between tasks is involved (and, notably, two tasks both wanting to read the same data can cause pain, not just read/write or write/write conflicts).


The first type is what people used to call "embarrassingly parallel". While it should be easy to have this solved by now across the board (after all most cpus are multicore now), arguably it is still not quite trivial or uniform, depending on which language or stack one works with.

The second case where there is data exchange between tasks is indeed the real challenge as the problem is basically open ended. The MPI approach conceptually can handle many cases but is maybe too much overhead to be the default programming paradigm. Which brings back to the question of low hanging opportunities. Eg He mentions in the book SQL and I think inner loop vectorisation is another example. But those are rather special 'graphs'


This is exactly what the BEAM and OTP do for Erlang/Elixir IMO.

In order to have parallel programming work effectively, you have to enforce a set of rules that ensures it always works reliably. You can’t add it on after the fact and that’s why it’s such a hard problem outside of the BEAM.


I dont know much at all about erlang but if it cracked this shouldnt it be more prominent in HPC type applications? Is there some other tradeoff?


There were some tradeoffs that caused heavy computations to be a concern for many years. Elixir Nx recently tackled the problem and is probably a worthwhile read if you're curious.

https://github.com/elixir-nx

https://dashbit.co/blog/elixir-and-machine-learning-nx-v0.1


It's slow, so while concurrency is great, it will never replace the core libraries written in C, C++, or FORTRAN.


I think it's because shared memory concurrency is easy to start on in lots of popular languages. But it doesn't take long until you're in a tricky mess of locks.

Actor style based on explicit communication and no shared memory is a lot easier to work with (IMHO), but it's not as easy to get started on because it's not as simple as pthread_create and go.


No shared memory will also mean that actor style is slower-than-single-thread-slow, when the "message-size to processing-per-actor ratio" is not right.

Actors and message passing is great for problems where it fits and worse than useless where it doesn't


Parallel programs are often way less efficient. Sure, communication is expensive and parallelism requires communication. On the other hand, they're often 2x less efficient or more. Poor scaling means you go beyond X-number of core/nodes (often a lower number than you'd like) and 90, 95, 99 % of the extra CPU power you throw at the application is burnt up in pure overhead.


I have a small project in Ruby to explore different concurrency and parallelism strategies: https://github.com/rickhull/miner_mover


I use OS threads + non-blocking IO with concurrent package for shared data in Java. The performance is incredible.

If I wanted to get a little more performance per watt I would probably rewrite it in C with arrays of atomic variables.

But you need a VM with GC to be able to be productive during the day and sleep at night, so probably not...


Now, if you could reduce the overhead of OS threads by using lightweight processes instead ... Wait a moment, isn't that what Erlang does? Well, project Lumen might help you out on the JVM at some point.


With NIO there is no OS overhead like context switches because you only need one thread per core as long as everything is async.

Erlang is single threaded unless you copy memory between threads.

The only step left for Java to implement (after io_uring file stuff) is user-space networking.


The Beam VM handles memory copying, afaik. It is very much multi-process. It has very lightweight processes, of which you can spawn thousands. Those are then run in as many OS threads as makes sense and are available to the virtual machine. It is unclear what you mean by Erlang (the language) being single threaded. The VM easily runs on as many cores as you give it, probably by using as many or more OS threads.


Its isnt hard but you need to change your programming standpoint. To write parallel code you need to think more about data alignment, dependency and flow. Its quite different than typical object/behaviour oriented programming.


Funny. I've been reading this for the past six months and just finished today.

Life is weird.


And, did you like it? I guess you did, otherwise you wouldn't waste six months on it. But can you share in a few words what you think of this book?


Silly me for thinking that people wouldn't care about my opinion.

It's a great book. Some things could be better, of course, but it's a free book, so the quality to price ratio is off the charts, and not just because it's free. I would have happily paid $75 for it, maybe more.

That said, it's a book that requires you to care about the Linux kernel, where the author's experience is. If that's a problem, then you won't get as much out of it.

The book is best used as a reference after reading through it once. I would do a medium-deep read the first time. This will tell you the concepts you need to look up later, what techniques exist, etc.

This is because the book delves into detail about various techniques to get concurrency. The philosophy is to do the easiest thing that works, which is great, but it does mean it talks about details. Thus, it's best as a reference later after absorbing the surface level.

However, that medium-deep read is still necessary for you to know what you need to look for later when you need details.

I hope that helps.


I read it a few years ago too. Paul McKinney is the author of RCU in the Linux kernel, so a lot of the book is centered around it.


It does. I appreciate the write-up.


I use ZIO (http://zio.dev) for Scala which makes parallel programming trivial.

Wraps different styles of asynchronicity e.g. callbacks, futures, fibers into one coherent model. And has excellent resource management so you can be sure that when you are forking a task that it will always clean up after itself.

Have yet to see anything that comes close whilst still being practical i.e. you can leverage the very large ecosystem of Java libraries.


Try using libthread on plan9, no locks.


My solution is to only solve problems that are embarrassingly parallel, like graphics (one pixel = one thread) or physics simulations (one object = one thread), and escape the pain of synchronisation.


I use channels to send messages between threads in Nim. It works quite well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: