Hacker News new | past | comments | ask | show | jobs | submit login
Parallelising Python with Threading and Multiprocessing (quantstart.com)
94 points by shogunmike on May 3, 2014 | hide | past | favorite | 37 comments



I'd like to point out that the Python standard library offers an abstraction over threads and processes that simplifies the kind of concurrent work described in the article: https://docs.python.org/dev/library/concurrent.futures.html

You can write the threaded example as:

  import concurrent.futures
  import itertools
  import random

  def generate_random(count):
    return [random.random() for _ in range(count)]

  if __name__ == "__main__":
    with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
      executor.submit(generate_random, 10000000)
      executor.submit(generate_random, 10000000)
    # I guess we don't care about the results...
Changing this to use multiple processes instead of multiple threads is just a matter of s/ThreadPoolExecutor/ProcessPoolExecutor.

You can also write this more idiomatically (and collect the combined results) as:

  if __name__ == "__main__":
    with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
      out_list = list(
      	executor.map(lambda _: random.random(), range(20000000)))
In this example case, this will be quite a bit slower because the work item (in this case generating a single random number) is trivial compared to the overhead of maintaining a work queue of 200000000 items - but in a more typical case where the work takes more than a millisecond then it is better to let the executor manage the division of labour.


To take advantage of process level parallelism you still have to have a pickle-able function, i.e. defined at the top level in a module.

  In [1]: import concurrent.futures

  In [2]: with concurrent.futures.ProcessPoolExecutor(max_workers=8) as executor:
     ...:     out_list = list(executor.map(lambda _: random.random(), range(1000000)))
     ...:
  Traceback (most recent call last):
    File "/usr/local/Cellar/python/2.7.6_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/queues.py", line 266, in _feed
      send(obj)
  PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed


Good point. Just a couple of points on futures: 1) they're backported to python 2[1], and 2) to make the example work, you need a pickleable function as you say, for example, if you have ipython running in a virtualenv:

    import pip
    pip.main(["install","futures"])

    import random

    def l(_):
        return random.random()

    with f.ProcessPoolExecutor(max_workers=4) as ex:
        out_list = list(ex.map(l, range(1000)))

    len(out_list)
    #> 1000
[1] https://pypi.python.org/pypi/futures


Whops, forgot to add a line to import futures:

    import futures as f #Include after pip.main(...


Wow - that is significantly more elegant than what I discussed in the article!

I wasn't aware of the concurrent.futures library, thanks for pointing it out.


concurrent.futures is nice but it's a real shame that ThreodPoolExecutor doesn't take an initializer argument like multiprocessing.Pool does; e.g., if you want a bunch of processes to work on a big data file, it's convenient to have all workers load that file at initialization. See https://code.google.com/p/pythonfutures/issues/detail?id=11


You should probably file your bugs here: http://bugs.python.org/ with the expectation that fixes will be backported.



This example is not too realistic and just narrows it down to the case where a job can be divided into isolated tasks with no shared data/state.

Often times threads need to update shared dict/list etc... With multiprocessing this cannot be done. You can use a Queue for this but it's horribly inefficient.

Generally speaking if you need performance and Python is not meeting the requirements then you are better off using another language.


Multiprocessing supports "managed objects", shared across multiple processes -- example: http://johntellsall.blogspot.com/2014/05/code-multiprocessin...

I'm unsure of the performance ramifications vs using concurrent.futures


I agree, the scope of the article is somewhat specific to the "toy" example presented.

Generally I would use C++ or (gasp!) Fortran with either MPI or CUDA for these sorts of tasks if performance was the most critical factor.

I'm excited by the Julia language though!


It's py3 but there is a back port of it for py2 that works wonderfully. I've recently begun using it and will never look back to multiprocessing (on which it is built).


You probably meant to reply to the thread about concurrent.futures.


> With multiprocessing this cannot be done. You can use a Queue for this but it's horribly inefficient.

So document that this is not just a problem with some program you wrote, but multiprocessing itself.


For the every day when I want to make embarrassingly parallel operations in Python go fast I find joblib to be a pretty good solution. It doesn't work for everything, but it's quick and simple where it does work.

https://pythonhosted.org/joblib/


I was going to discuss Parallel Python (http://www.parallelpython.com/) in the next article - have you used that? How does it compare to joblib?


I haven't used that, but it looks interesting. After a brief look it seems like they both submit jobs to Python interpreters started up in other processes.

Parallel Python (PP) seems to have a clunkier API, but also more functionality. I think the biggest advantage is that it can distribute jobs over a cluster instead of just different cores on the same machine. I might look into PP if I need to do things on a cluster, but I think I'll still stick with joblib when I'm on one machine.

That's just my first impression. I'd be interested to read your blog post.


Feel free to checkout http://docs.openstack.org/developer/taskflow/ (it has a similar set of concepts to both of the mentioned libraries here, joblib and parallel python).


I've had good success using Celery to parallelize tasks/jobs in python.

www.celeryproject.org

Also, it has a very nice concept called canvas that allows you to chain/combine the data/results of different tasks together.

It also allows you to switch out different implementations of the communication infrastructure that Celery uses to communicate and dish-out tasks.


For python developers who dislike the continued existence of the GIL in a multicore world, and who feel that multiprocessing is a poor response given the existence proofs of IronPython and Jython as non-GIL interpreter implementations, please consider moving to Julia.

Julia addresses nearly all the problems I've found with Python over the years, including poor performance, poor threading support on multicore machines, integration with C libraries, etc. I was a big adherent of Python but as machines got more capable, the ongoing resistence to solving the GIL problem (which IronPython demonstrated can be done with reasonable impact on serial performance) I could not continue using the language except for legacy applications.


This comment overstates the current power of Julia's parallel programming model — as of now Julia has no real tools for shared-memory parallelism and probably will not for another few versions or so. For distributed memory Julia is great, but please do not use Julia if you are being hindered by the GIL.

(NB I say this as a big Julia evangelist. it has a lot of potential but is not really there yet on a number of things, this being one of them.)


This is not strictly accurate. Julia does not support multi-threaded parallelism, but there is decent (if, yes, still immature) support for multi-process shared memory parallelism - similar to Python's multiprocessing library. Not an alternative to the GIL as such, but definitely more than nothing.

One nice example using this is a shared memory, parallel sparse matrix multiplication implementation:

https://github.com/madeleineudell/ParallelSparseMatMul.jl


I take back my claim. Thanks for pointing this out!

If they can't support true multithreading without having to pack messages or use /dev/shm, fuck em.


I don't know what you are talking about. The GIL has never bothered me. I have been using Python together with multiprocessing and threads with concurrent.futures. For integration with C libraries I use Cython; generally interfacing with C is one of Python's strong points, don't know where you got that from. Have you actually looked into why Python has a GIL? It's a pretty clear trade-off, I think. It seems intuitive to me that requiring lots of small locks to avoid a global lock might not be beneficial, and attempts to get rid of it such as PyPy is doing with software transactional memory involve big changes, so it's not like you can decide overnight "let's get rid of the the GIL".

Julia looks nice but comes with its own set of problems: no inheritance, 1-based indexing, less libraries, less mature.


Yes, I have looked into why the Python has a GIL. I've even written C interface code which released the GIL and then reacquired it when necessary (I know a ton about this, having spent too many of the last 20 years integrating C and python). Yeah, I actually know what the tradeoffs are and can evaluate them (I used to work with the author of IronPython).

you have several choices for C integration in Python. SWIG, which is now generally considered a huge mess, hand-wrapping, which is a tedious pain, and dlopen/dlsym methods that talk to the C api direectly (which requires something like GCCXML to handle type recognition for complicated APIs).

I don't think PyPy's approach to transactional memory is the right direction either.

In short: multithreading on multicore machines is how you write performant software in industry. The hardware is designed for, the compilers are designed for it, and if you don't take advantage of it, you're just wasting machines.

Now people could argue that multiprocessing addresses it, but it's just message passing between different process spaces, which while a wonderful and powerful tool, is ultimately just more cumbersome (hey, I used to write big MPI/OpenMP apps that did both models at the same time).

Anyway, the ultimate existence proof is that IronPython was both faster serially and in parallel, without the GIL, than CPython. So basically we know it's possible. The Python developers have no will, inclination, or ability to make it so,.


It would be interesting if you could give some arguments for your positions. Why is STM not the way? Why, if IronPython is as good as you say it is, doesn't it see greater adoption or why don't other implementations use its strategies for removing the GIL? Wikipedia says that IronPython scores worse on PyStone benchmarks compared to CPython, and it's likely that this is a consequence of IronPython's fine-grained locking which is required in the absence of a GIL.

As for interfacing with C, like I said, Cython really makes this a lot easier than the approaches you mention. You mention IronPython as not having a GIL, but then IronPython doesn't allow easy interfacing with C code, e.g., it's not compatible with numpy ...


Python threads, aren't they just single threaded execution?!


Yes, they are concurrent but not parallel.


See other replies to the post you've replied to: threads in Python can be "parallel", if one of the threads releases the GIL. This can happen during calls to I/O, or more generally, any C call that decides to release the GIL. Most of the time, you're doing I/O anyways, so it suffices. If you're not (you're truly doing computation), then there is multiprocessing.


Yea - decent approach if you're doing a lot of IO but for computation you're limited by the GIL.


People don't seem to understand this. If you are doing lots of IO Python threads are fully performant and the GIL isn't an issue at all.


Have you seen there is an error in the code for the threading part?

the right way if you want to use a thread is thread = threading.Thread(target=CALLABLE, args=ARGS)

and not

thread = threading.Thread(target=CALLABLE(ARGS))


For the example task we could use the multiprocessing Pool and (the undocumented) ThreadPool.

This implements the worker pool logic already so we don't have to.


I was considering adding this but I wasn't fully sure that it would be good content for a "first intro to parallel programming" article. Perhaps a good candidate for the next one?

Thanks for mentioning it though.


For network bound operations Twisted's cooperate / coiterate come handy.


Or just use a better language. One is that is actually compiled and fast?


Speed is just one of the many things that people consider when evaluating what language/stack to use for a particular job/task.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: