What's interesting is the implicit assumption that mutating memory is O(1).
Unfortunately for real memory (caches, DIMMs, swap etc) that's not true. It's O(log n) for any non-trivial size of memory, and can have bad (constant?) factors for mutating memory that is shared between processors.
Of course functional languages have hidden and not-so-hidden costs too, starting with the garbage collector, but including the time taken to get the larger working set into cache.
I'm interested in any studies that look at these total costs in real systems.
You raise a valid point, and one that a lot of algorithm and data structure textbooks neglect to mention at all; seen purely from a practical perspective, things like locality of data with respect to CPU caches are probably more important than picking a slightly cleverer algorithm/data structure on any modern-day CPU.
This once again raises the more subdued or neglected aspect of "premature optimisation": blindly picking an algorithm that looks better on paper may in reality be much slower than a naive algorithm that better matches the characteristics of a computer.
Some algorithms are O(1) and you could easily be fooled into believing that they are, indeed, always better than their counterparts that're asymptotically worse than a constant-time; that may not be so, though, if the "constant-time" computations required for non-trivial inputs take longer than say a linear one.
You raise a valid point, and one that a lot of algorithm and data structure textbooks neglect to mention at all; seen purely from a practical perspective, things like locality of data with respect to CPU caches are probably more important than picking a slightly cleverer algorithm/data structure on any modern-day CPU.
It may not make it into the basics, but locality of data is absolutely considered in a lot of real work on algorithms and data structures. For example, the theoretical advantages of hopscotch hashing rest on data locality concerns. On a hypothetical machine for which memory access always takes the same amount of time, I believe - but have not bothered to prove - that hopscotch hashing and other open addressing techniques should actually be slower than other forms of collision resolution.
As for how it affects the question of imperative vs. functional data structure performance, it should tend to favor the imperative ones. The issue is that purely functional data structures aren't allowed to just grab a block of memory all at once and insert data into it at leisure. (That would be a form of mutation.) Instead space has to be allocated as it is needed, in a piecemeal fashion. As a result, longer-lived and more active purely functional data structures will tend to scatter across the heap over time.
Actually, the question then becomes one of trading off sharing vs cache locality. To whit, for many functional languages, memory allocation is essentially just a pointer increment, so there is no a prior reason to not assume that a freshly allocated tree occupies a contiguous region of memory, or at least contiguoudly on each page where it's been allocated.
Claims about data structures in the context of how they're allocated in a modern machine architecture / memory model are not the same thing as claims regarding the semantics that are exposed to the user of a programming language.
I think you do raise an interesting question, namely what's the layout in memory of a tree like data structure in eg scheme or Haskell, when the tree is built from scratch, but also when updated and a garbage collection run has compacted the data. Oooo, I totally want to investigate this now! (I don't have the time to, but I totally want to.)
Point being, depending on the memory management semantics for heap data in a programming language, any claims about locslity happening for tree data may or may not happen.
This actually raises a really beautiful question: what is the complextity of any garbage collection algorithm that exactly or approximately maximizes the locality of tree or directed graph like data structures in memory? I want to speculate that perhaps the data structures which people find fast in practice are actually in some sense enabling the gc to preserve locality. I wonder how optimally the layout can be when only linear or nearly linear time algorithms are considered!
Very true! Hence the part where I'm thinking out loud regarding how the gc ordering of data after a copy/compact (in the one generation gc) impacts locality. And I suppose this also applies to more modern gc setups like multiple generations or parallel collection. Is there some notion of data locality that we can formalize and analyze for at least simplified versions of modern memory hierarchy? How well do standard gc algorithms perform when analyzed thusly? How efficient can each style of gc algorithm be if it's designed to optimize locality?
It clearly has some relation to optimal caching optimization problems, but with more explicit interdependencies. It's also seemingly both an online and offline problem in terms of how to analyze it, but I need to think on it more
> As for how it affects the question of imperative vs. functional data structure performance, it should tend to favor the imperative ones. The issue is that purely functional data structures aren't allowed to just grab a block of memory all at once and insert data into it at leisure. (That would be a form of mutation.) Instead space has to be allocated as it is needed, in a piecemeal fashion. As a result, longer-lived and more active purely functional data structures will tend to scatter across the heap over time.
It's a bit more complicated than that. To talk about runtime (and memory) of an algorithm or a data structure, you need to have a machine model. If you look at what happens in, say, Haskell: We write our algorithms in a purely functional way, but we are interested in their runtime after compilation into native code. The compiler is free to do any transformation that preserves correctness, so if it's smart enough, it might grab a block of memory all at once. Or if it can prove that data won't be used again, it can safely change variables in place. (The language Clean even has type system support for that kind of mutation. (See https://en.wikipedia.org/wiki/Linear_type_system))
You might be able to analyse functional algorithms with a machine model that doesn't involve compilation to some form of imperative code. Perhaps by counting reductions in lambda calculus. But then it's hard to compare that to asymptotic performance of imperative algorithms.
Absolutely. But many of those hypothetical compiler transformations would only work if the code isn't really taking advantage some of the most important advantages of purely functional data structures. Persistence, for example.
That, and in a sense it's side-stepping the theoretical issue more than anything else. A compiler that's smart enough to find spots where a performance improvement can safely be gained by replacing a pure structure with a mutable one in the background doesn't really counter the idea that impure structures can be faster so much as acknowledge it.
Yes. And instead of comparing purely functional vs imperative languages, we could sidestep those issues by comparing persistent vs ephemeral data structures, regardless of language.
(For an example of the distinction--straight out of Okasaki's book: the standard purely-functional queue implementation as two linked lists has O(1) amortized runtime in adding and removing only if used `single-threaded' / ephemeral. As a persistent data structure you get a worst case of O(n).)
Sure. Honestly, that's how I had been interpreting it from the beginning. The SO question seems to be specifically talking about algorithms and not programming languages.
Even the "on-paper" analyses can take some of this into account, with algorithm analysis models that take into account cache and NUMA architectures. But, people do tend to just look at the classic big-O analyses, which may not be right.
In a way that kind of dependence isn't new. If you look at old versions of Knuth's TAOCP, for many algorithms he has separate analyses for the in-RAM case and the on-tape case, using two different memory models, and in some cases which algorithm is better differs a lot between the two cases. But it may be that algorithms reference books are doing a worse job keeping up now than they used to.
Knuth is probably the only (major) author that I can think of off-hand who took the application of his algorithms and data structures so seriously (going so far as to invent his own little RISC instruction set, MIX (later MMIX.) in a book that concerned itself with the theory as aswell as the practical aspects. Most books focus almost exclusively on one or the other.
I think aside from things like cache coherency, memory models and things like SIMD, computer scientists will have to think of their algorithms (or at least cover it when they write books for undergrads or professionals) in terms of commutativity and associativity so people can see how parallelisable an algorithm is or can be.
I can't upvote you enough. Too often programmers conflate "functional programming" with "availability of an O(1) array data type". The former does not imply lack of the latter, and (as you stated) lack of the latter does not imply the former.
The important takeaway I think, is that O(log n) approaches O(1) for all practical values of n.
To the downvoters: gee golly, would it hurt you to reply to my comment? If you think I'm wrong then say why!
Edit: O(log n) is clearly worse than O(1) if you just considered an O(log n) array-replacement data structure or hash table-replacement data structure and considered what its performance would be. I mean, how is this even a question?
Also if memory is O(log n) then arrays are O(log n) but BSTs are O((log n)^2).
(Edit II: Let's be clear that BSTs are already (log n)^2 if you're counting CPU cycles with O(1) memory access. They're O(log n) if you count cache misses or comparisons, or you might say they're O((log n)^2) but the first log is base <2 and the second is the logarithm base 2 to the 256th power, so only one really counts. (The second manifests as a cache line miss if your key comparison walks past the first 32 bytes of your key.))
(Actually memory access is at best O(n^(1/3)) thanks to space being three dimensional, but really worse thanks to cooling concerns, and uh, for sufficiently large n, worse because of gravity.)
O(log n) is clearly worse than O(1) if you just considered an O(log n) array-replacement data structure or hash table-replacement data structure and considered what its performance would be. I mean, how is this even a question?
Why yes, it is worse in some theoretical manner. But if you think it is necessarily worse as a practical manner, you have failed to understand the math fully.
Suppose that n is 1000 in some toy problem. Then you scale it up to a huge data set, a billion items! The log(n) factor only got worse by a factor of 3. Maybe we didn't go high enough. Perhaps we can take a large data set. Let's see, how about all of the data that is produced at CERN's Large Hadron Collider in a year? That's 15 petabytes. Now the log(n) factor is a whopping 5.39.
In other words while log(n) represents unbounded possible growth, it gets worse in practice by at most a fairly small constant factor.
How about O(1)? O(1) does not mean fast. It just means constant relative to the data size. It can be a bad constant, but as long as it is a constant it qualifies as O(1).
Let's take a practical example. Wavelets are cool in lots of ways, but one of the properties people noticed quickly is that doing a basic wavelet transform on a data set with n elements takes time O(n). Doing a FFT (Fast Fourier Transform) on the same data set takes time O(n log(n)). Yay, we're faster!
But hold on a minute. The interesting wavelets that we like to use are O(n) with a worse constant than the FFT. So the FFT tends to be faster on practical data sets. (There are lots of reasons why one would prefer wavelets over the FFT, but speed of calculation is not generally one of them.)
> But if you think it is necessarily worse as a practical manner, you have failed to understand the math fully.
Congratulations, I don't think one algorithm is necessarily worse, your entire post was a waste of effort.
I mean seriously you just made a post teaching about how logarithmic constants can be low enough that they don't matter in reply to my post which provided an example of that very thing.
I was just explaining the comment that O(log(n)) approaches O(1) for all practical values of n.
If you understood this, then why did you start off your post arguing that O(log(n)) was worse than O(1)?
Incidentally comparing a logarithmic lookup to a hash lookup, a lot of nosql solutions use hash tables because O(1) is good, right? For instance look at Apache's Cassandra. By contrast Google's BigTable uses an ordered data structure which is logarithmic.
Guess what, if you've used both, BigTable is better. Because the extra log cost is irrelevant for the normal use case, and being ordered reduces the operational cost of things like range searches.
I found the non-argument that you two just had to be interesting, informative, and entertaining. So don't feel too bad about losing the argumentative-agreement battle - it was great while it lasted!
"I can't upvote you enough. Too often programmers conflate "functional programming" with "availability of an O(1) array data type". The former does not imply lack of the latter, and (as you stated) lack of the latter does not imply the former."
Yes, even the pure Haskell has unboxed, mutable O(1) arrays. They're just
a bit hidden for the beginner.
To my knowledge, mutating memory is asymptotically the same as just reading it. Hence any asymptotic slowdown purely functional language have will translate, regardless of how long memory operations take.
Are you sure? Writing to memory that might be shared between processors certainly isn't the same as reading, and so a parallel algorithm (and what isn't, these days?) should do better sharing lots of read-only data, rather than mutating shared state.
Unfortunately for real memory (caches, DIMMs, swap etc) that's not true. It's O(log n) for any non-trivial size of memory, and can have bad (constant?) factors for mutating memory that is shared between processors.
Of course functional languages have hidden and not-so-hidden costs too, starting with the garbage collector, but including the time taken to get the larger working set into cache.
I'm interested in any studies that look at these total costs in real systems.