Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Why should FP be slower ?

The articles emphasises that one of the main arguments in favour of FP is that it's easy to see what part of the recipe can be done in parallel. Why can't the runtime automatically leverage multiple cores and speed up the process ?



Immutable persistent data structures used in FP tend to perform slightly worse than their "conventional" counterparts. So if you can manage to run your non-FP program in parallel with a shared mutable data structure and no synchronisation you should see better performance. In practice, however, you'll almost certainly need synchronisation on mutable shared state and it might very well be slower. So neither approach is per se faster than the other.


There are ways to encode imperative mutable updates to data structures in a functional style with tail calls and sufficient compiler smartness. I don't think you really come out ahead if you go down that route though.


To sum it up, it's a kind of chicken-and-egg. Imperative/OO is tied to the physical implementation of processors, whereas FP is tied to a mathematical construct for which conventional CPUs are not optimized.

Related:

https://github.com/tommythorn/Reduceron

https://stackoverflow.com/questions/15005670/why-is-there-no...


Modern hardware struggles to impersonate a C/C++ abstract machine too. Machines with multiple cores and distributed memory hierarchies are difficult to program effectively with old OO/imperative languages.


> Why should FP be slower ?

It's easier to move an object 1 cm to the left than it is to bootstrap a universe where the object happens to 1 cm to the left.

> Why can't the runtime automatically leverage multiple cores and speed up the process ?

That is more related to proper dataflow analysis than immutability, and this is much more mature in VM runtimes that are actually used (hundreds of man-centuries vs. the odd PhD thesis).


> It's easier to move an object 1 cm to the left than it is to bootstrap a universe where the object happens to 1 cm to the left.

structure sharing solves this, if my understanding of what you are implying is correct.


> It's easier to move an object 1 cm to the left than it is to bootstrap a universe

That's like saying sorting can never be fast when the only algorithm you have is bubble sort.


> Why can't the runtime automatically leverage multiple cores and speed up the process?

Do any runtimes actually do this? (Honest question)


Erlang/elixir's runtime, OTP (open telecom platform) will do this. It can scale to multiple cores, or across multiple machines in a network, almost effortlessly.

https://en.wikipedia.org/wiki/Erlang_(programming_language)


It doesn't happen automatically on BEAM - you still need to explicitly spawn your processes.


No. The reason? In the nano second to 1 microsecond range your CPU already automatically parallelises your code by using out of order execution. Unless you're wastefully busy waiting, the latency of issuing a new job to a second thread is in the 10s of microseconds. You can't just issue new job without first knowing how much time the job is going to take and whether it's even worth it to parallelise it. Profiling every line of code at runtime will introduce even more overhead. It's far simpler to just let the developer annotate the parts of the codebase that should run in parallel.

Example: r=a/b + c*d can in theory be parallelised but you'd have to pay 10 microseconds in overhead which is likely more than 5000 times slower than just running it on a single core.

TL;DR automatically parallelising a program is trivial, making a parallelised program run faster than a non parallelised program is not


AFAIK this is still an active area of research.

In principle pure functional programs can be freely parallelized, but in practice it's hard to do it completely automatically without losing more in overheads than you gain through parallelism.

In Haskell I think the standard approach is still to annotate your code, indicating which parts should be evaluated in parallel. See https://wiki.haskell.org/Parallelism


Even if all of your functions are pure they can still have data dependence with one another. So you obviously can't parallelize foo(bar()) automatically. There are other things that can make a real mess of paraellizability too, like error checking.

Say you have some code like:

    comprehend all item in list, 
      verify(item) or errorout("failed at ", stringify(name(item)))
The compiler can't auto parallelize it because the exit condition is unknowable, at least if you want there to only be a single exit from the function.

You can get around this with even more strict requirements, but after enough of these restrictions even simple tasks start to get hard.


Edit: Apache Spark. See below.

Some do. But more importantly, if your runtime doesn't, then it's still more obvious how to add parallelism, and more self-contained when you do.

Let's say you were using asynchronous IO in a functional style; not even full-on monads, just async/await like in C#. Since you have explicit data dependencies, you can easily see where the `await` is unnecessary, or coalesce multiple calls into await Task.WhenAll(...).

    var slow = await conn.Query(@"slow query");
    var fast = await conn.Query(@"select top 1 * from some_table");
    var emailresult = await SendEmail(fast);
    return (slow, emailresult);
Instead of awaiting `slow` before continuing with `fast`, it's fairly obvious that you could start the task and not await it until the others have already kicked off:

    var slowTask = conn.Query(@"slow query");
    var fast = await conn.Query(@"select top 1 * from some_table");
    var emailresult = await SendEmail(fast);
    return (await slowTask, emailresult);
That way, your email sends almost immediately, since line 1 never relinquishes control flow and just continues straight on. Then your `await slowTask` only blocks the return, instead of the entire function. This is pretty big if you're lining up a hundred slow IO tasks.

Note: while async IO gives you the actual parallelism, it's only because of the explicit dependencies that we are able to see which parts can be done simultaneously. I've been coding like this for so long I can barely imagine how to do it imperatively, but if you exposed an interface like `async Task GetSlowQueryResults()` and `async Task SendEmail()` using internal fields for intermediate results, it's utterly opaque and you can't tell which has to complete first.

Edit to respond better to your actual question:

There's a kinda big problem with auto-parallelizing, which is that in FP, outside of IO, everything is parallelizable. Spinning up threads or even shuffling data within a thread pool is pretty costly, so most attempts at micro-parallelizing everything haven't gone anywhere. You can annotate things in Haskell to run in parallel, or simply choose a parallel map or parallel reduce instead of a sequential one.

On the other hand, when spinning up threads/processes/jobs is relatively inexpensive compared with the cost of actually computing things, the idea is very successful. Take Apache Spark. You write a program in functional Scala or Python, and the scheduler will automatically distribute work over your cluster if it's big enough to warrant it. The scheduler also makes note of where intermediate results are, and will happily cache, reuse, shuffle or partition them around to continue on with the work. The engine essentially evaluates a DAG, with some optimizations for transport/persistence costs.


Matlab.


It seems obvious to me. All the copying of immutable data.

To turn the question on its head you could ask, "why are GPUs or DSPs so blazingly fast?" Part of the answer is they do a lot of operations "in-place", treating data as mutable wherever possible, in parallel, using heavy pipeline optimization to trawl through and update memory in place, sometimes using "modulo addressing" modes to make hardware-supported FIFO structures.

These techniques are all aimed at reducing the number of memory fetches, cache misses, etc. I don't know how FP languages are implemented at the runtime, but I'm guessing they don't go to these extremes.


Immutable data does not necessitate copying the entire structure when it is shared or changed. I encourage you to take a look at persistent data structures[0]. Okasaki's "Purely Functional Data Structures"[1] is a good resource if you'd like to look at this in more depth. Rich Hickey's talk about the implementation of persistent data structures in Clojure is also a nice introduction[2].

[0]: https://en.wikipedia.org/wiki/Persistent_data_structure

[1]: http://www.cs.cmu.edu/~rwh/theses/okasaki.pdf

[2]: https://www.infoq.com/presentations/Value-Identity-State-Ric...


Only skim-read so far but that's interesting stuff. I had no idea FP data structures were implemented like this: I consider myself duly educated, thanks!


Because sometimes this is a win, but sometimes the overhead of forking off a separate thread and synchronising on the result costs more than the benefit. Compilers are very bad at figuring out which is which. Haskell lets you add parallelisation instructions to tell the compiler which is which without changing the semantics of the program.


Because many problems encountered when trying to make them fast end up as NP-C with slow/inferior heuristics.


Without immutable data structures, FP will be much much slower.


Yep. I wanted to try some immutable data work so I looked into using an immutable library in Ruby (Hamster) but it was so much slower... so I went to Elixir and never looked back




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: