Additionally, his premise that CPUs will continue to get more and more cores is probably also flawed. The diminishing returns of the extra cores due to Amdahl's law will make them less and less economically viable.
Intel and AMD want to continue making processors that cost more than $150. If feature size continues to shrink, what else are they going to do with the extra room on the chip? (other than slash their own prices using ever-more-expensive fab equipment)
In single threaded code, the unused cores can be powered off.
Recently I heard a talk by a guy from Oracle designing Sparc CPUs. IIRC, he said that while they could potentially put bigger caches on their CPUs, for their test-cases it wasn't that much of a benefit above their chosen size.
Another interesting note from that talk was that they had spent a large amount of time making an 8 socket architecture with quick-connect paths between all pairs of sockets for memory transport. Unfortunately that was not as important as believed either, since the programs that typically benefit from 8 CPUs normally have good memory properties anyway.
Not sure if either data-points are true, but interesting nonetheless.
And with programs for which Amdahl's Law isn't a huge concern, multiprocessing isn't necessarily even the most appropriate form of parallelism. Multiple cores is great for situations where the sequential fraction of a program is relatively high, because it helps minimize the cost of synchronization.
But it can also bring some problems that might limit performance, such as bottlenecks on shared resources such as memory and I/O. If the problem can divided into fairly large and independent chunks of work, on the other hand, then distributed computing becomes a lot more attractive.
That would only be true if systems are only capable of running a single program. Luckily I think there will be utility in running 10,000 processes on a box at the same time.
> A single core of my i7 does something like 6 or 7x the work a single P4 core does, at half the mhz.
Do you have a reference for this? I'd be surprised if it's more than 2x for general-purpose code. (It might be more for specialized stuff like video decoding, which I guess has improved hardware support these days.)
A netburst vintage P4 at 1.8ghz rates 217 (original P4 2001). 358 if we move up to the 2.8ghz model (2003 model). So anywhere between almost 10x the performance to 5.5x depending on model, mhz, etc.
Note: pentium is a brand and later its the budget name of C2D's, but this comparision to P4's.
No idea what cpubench uses, but its not soley a video benchmark, its more of a mixed bag.
It depends on how the benchmark was written. If it concentrates on crunching numbers and uses not much memory, then the caches are being hit, and caches/memory in general is where a lot of speed went in the last few years (also decrease in price).
But for some more general application, this might not be true. Then again it depends. For example a lot of the Adobe 2d-filters would gain from such thing, if the data is accessed the right way (swizzled if possible). But then such algorithms usually tend to be one of the 13 dwarfs that are easily parallelized using OpenMP, or rolling your own thread-version. Or even with CUDA/OpenCL/DirectCompute - but then you might have to pay for the communication between the CPU/GPU, and loss of result, or instability of results across machines (different floating point accuracy tradeoffs for the sake of speed).
But then the problem comes with state-machines, for example an LZ compression, or anything that relies on results from before. This is very hard to data-parallelize.
Well, depends on what we want to measure with synthetic benchmarks but general terms like "processing power" are fair to be interpreted as general and common PC tasks that these benchmarks address.
But yeah if your load is specific like floating point calc or mp4 encoding, then it will vary, but its a good shorthard to have these types of conversations, especially when people go half-cocked about how CPU progress has stalled which it clearly has not.
You can see it by comparing results on Passmark's website, for one.
It's because of a lot of things. Perhaps the i7 has more pipelines, but significantly it also has much shorter pipelines. Stalls were really, really expensive on the P4's 30-stage pipeline. Plus there's hyper-threading. And of course the i7's got a lot more going for it in the cache department.
Architecture improvements continue, but if you look at the single-core performance curve, it has been much flatter since 2004.
The "won't ever get faster" part should be qualified, though, with "at least not on silicon". There is still the possibility of a breakthrough using some other semiconductor -- silicon germanium, buckytubes, or something. Any such thing is years away, admittedly.
Unless the breakthrough involves making arbitrary-thickness chips, it's only going to be a modest improvement; the best possible density/frequency for any realistic material isn't that many half-nodes better than the best for silicon.
If you look at his source for that, it is purely based on the clock speed and not, (which you might assume from the context), "the speed at which <some given software> runs".
The assertion that we hit the limit in 2004 and CPUs have not gotten any faster (in the useful sense of the word) is ridiculous, and for someone to still be perpetuating the gigahertz myth in 2012 is bewildering. My MacBook Pro has a 2.2Ghz Core 2 duo, and my netbook has an Intel Atom 1.6Ghz, and I can tell you now the MBP single core speed is not merely 40% faster!
I call bullshit. Researchers have been trying to make auto-parallelizing compilers for decades without anything substantive making it to industry.
Even in functional languages, auto-parallelization hasn't worked well because of the coarseness issue: it's difficult for compilers to figure out what to run in multiple threads and what to run single-threaded because of tradeoffs with inter-thread communication.
Have you used parallel extensions? They aren't always faster, and can often over-commit resources.
It's just like the GPU, you need massive amounts of data to overcome the communications overhead.
That F# can outperform C on numeric code should tell you that the optimizations available to functional code far exceed those available to languages that poke bits.
This is just one more micro-optimization to overcome what a bad idea it is to poke beads on an abacus rather than use mathematical rigor available in calculus.
Do you have a citation for this? I love F# and I've written some fairly pointer-intensive code with it, but the CLR's code-gen is pretty bloody abysmal[1]. F# does more optimizations than C#, but most of the heavy lifting is left up to the CLR/JIT (obviously).
In fact, I find I have to experiment with F#'s inline feature for high-perf code, because the CLR optimizer works far better on big functions than inlining/optimizing calls to smaller ones. Even simple stuff like removing a tuple allocation is not done. Even basic things like eliminating the allocation of a tuple when it's obvious you're immediately deconstructing it is not done. F# doesn't even lift lambdas, as of 2.0.
F# doesn't even do fusion like Haskell, so for high-perf code, you're often giving up nice functional code and resorting to loops and mutability. Just check the F# stdlib implementation.
So while I really do love F# and enjoy writing in it, "faster than C" is not really applicable. A: You'll be writing C-in-F# (which is fine, if most of your app isn't that way), B: the codegen isn't remotely competitive with a modern C compiler.
[1] Edit: OK, that's an exaggeration, but I mean in comparison to what you'll get out of a C compiler for equivalent, low-level code.
Well, Java has also been shown to outperform C on numeric code. I don't exactly see it as a winning declaration of garbage collection and dynamic compilation. I suspect it's the same way for those numeric benchmarks you have.
Yes, I have used them heavily and just like any other tool you need to know when to use it. You fully misunderstand my statement, and that is probably my fault for being succinct. I wasn't saying that parallelism is the one-all solution, but rather I wouldn't shrug the research off as fallacy when they have a history of delivering.
Another problem, even in functional code, is that data dependencies tend to be linear. It's difficult to write code that can be automatically parallelized.
Actually the problem is a bit more complicated than that. It is often quite easy to write a program with a large degree of parallelism in its data dependencies. The problem is to actually implement an execution strategy with sufficiently low overhead that it speeds up execution. Communication overheads and poor memory locality often overwhelm any gains due to parallel execution.
I second the call of bullshit. Though I've only skimmed the paper, I don't see any concrete performance numbers. I only see this:
"The group has written sev- eral million lines of code, including: core libraries (includ- ing collections with polymorphism over element permis- sions and data-parallel operations when safe), a webserver, a high level optimizing compiler, and an MPEG decoder. These and other applications written in the source language are performance-competitive with established implementa- tions on standard benchmarks; we mention this not because our language design is focused on performance, but merely to point out that heavy use of reference immutability, includ- ing removing mutable static/global state, has not come at the cost of performance in the experience of the Microsoft team."
"we mention this not because our language design is focused on performance, but merely to point out that heavy use of reference immutability, ...has not come at the cost of performance in the experience of the Microsoft team."
Well, sort of, but then the moment you run into hazards [1], bad branch predictions [2] or any other problems, the CPU will either stall the pipeline a few cycles or just flush the whole thing, so it's not like it's a magic solution just waiting to be adapted.
I think you're right about the industry part, but during a research internship I did two years ago I worked on SAC [1], which is a high-level array-based functional language and which does implicit parallelization. Okay, it's not really a general purpose programming language and more of a DSL for nurmerically intensive computation, but the auto-parallelization does work.
Nope. Note the use of "auto" in the OP's message. With GCD the developer has to manually indicate where parallelism happens. There are numerous other systems for doing that.
The hard problem is making the compiler automatically figure it out, and as OP says they haven't done so usefully (yet).
The author seems to think that because a language supports functional programming, it cannot support mutable state in any form. It would be worthwhile to actually look into what sort of support for various types of state that a language like Clojure has (http://clojure.org/state). There are at least 3 ways off the top of my head that I can change the value of a variable in a Clojure process during runtime. It's obviously not the sort of control over state that you have in C++, but it's a far cry from define a variable once and that's all you can do.
Functional programming does shun mutable state. It just doesn't make it impossible.
Functional programming is a style of programming which uses the lambda calculus as its base model of computation. Just because you can subvert that at times doesn't mean that idiomatic code doesn't disapprove of breaking referential transparency.
Functional programming doesn't shun mutable state. It shuns global state, which is not the same thing. Mutability is well-supported even in very stringently functional languages such as Haskell.
There's no reason to avoid ST, and in many cases it makes the code clearer and shorter than trying to figure out some pointfree contortionism to reach the same goal.
I'll look at Coq when it has support for binding to C libraries, opening sockets, or doing anything else than a programming language needs to support.
You seem to be setting up a false dichotomy between using the ST monad and writing pointfree code ("functions as a composition of other functions, never mentioning the actual arguments they will be applied to"). But you can write non-pointfree code that doesn't use the ST monad, and you can write pointfree that does use it. They're unrelated.
ST is just a prettified C that's safe and well-typed. If you want to write C in Haskell that's fine, but you're not doing functional programming.
This is always the user/theorist divide. I'm not saying that sometimes ST isn't necessary, but it should always be the smallest necessary imperative subset.
By the way, the authors of the ST monad make this same point (read the paper). ST is (by necessity) always single threaded. It ruins modularity by not being able to interact with any non-ST code. In short, it's not functional.
Anyway, theorists will continue to do interesting things with Coq and whether or you think it's sufficient as a "programming language," it's useful to us.
The research paper linked to by the author of this article also shuns mutable state, but takes a different approach to how you might do it in Pure FP languages (e.g, with monads).
They basically describe a modified C# language, with global state (static variables) eliminated - any global state is immutable only. Naturally, anything marked immutable cannot access anything marked writable, so unless you want to put writable permissions on everything (turning it back into standard OOP), you're going to need to design programs around immutable data by default. Also, anything marked writable in the new language can only be written to by one thread at a time anyway, since the writable references are unique.
The result is a significantly different style to the traditional imperative OOP people are used to, so it's not like you can lift an existing program into this paradigm without problems.
If the author thinks that FP won't become mainstream, what makes him think that this will?
Here is a good overview of the 13 isolated computing problems when comes to multi-threading.
It's really not about compilers, fp, not fp so much as to what kind of tasks are easily multi-threaded. As I replied in another post - things like state-machines are very hard to parallelize (one of the dwarfs that the article below talks), while others very easy - like raytracer (to a point, since a raytrace have to share data across all computing units - cpu-s).
"Multithreaded code is a particular kind of parallelism"
No.
Parallelism implies that there are multiple CPUs and/or multiple cores at work (or in older use of "parallelism", at least some instructions that are parallelized).
Multithreading doesn't imply that at all: you can have a multi-threaded program (like, say, a Java program using multiple threads + its GC thread, EDT thread, etc.) running on a single-machine using on CPU which has a single core and which doesn't do any kind of parallelism.
To me you're totally wrong in saying that multithreaded code is a particular kind of parallelism.
You are correct, I over-simplified. Multithreading is a way of implementing parallelism, but it will not always lead to parallelism. Multithreading always implies concurrency, but only yields parallelism if the underlying hardware allows it. (Concurrency allows for time-sharing the processor so they are not actually running simultaneously; parallelism means they are running simultaneously.)
However, I was not primarily concerned with this distinction, but with the distinction between obtaining parallelism through multithreading, versus parallelism through message passing. Multithreading implies that parallel threads will communicate implicitly through shared memory, which does not scale past a single machine.
Pig can turn a combination of relational operators into a series of Map and Reduce operations that can be done on a Hadoop cluster. This is all stuff I can code up by hand, but most the things I might do with a shell script or SQL statements I can parallelize in a way that's scalable in both directions. Because I can write parallel code easily and quickly, I can use it to do little jobs. As for big my home cluster handles terabytes and I can rent any level of power from AWS.
The paper referenced by the article, "Uniqueness and Reference Immutability for Safe Parallelism", says:
The group has written several million lines of code, including: core libraries
(including collections with polymorphism over element permissions and
data-parallel operations when safe), a webserver, a high level optimizing
compiler, and an MPEG decoder.
Incidentally, I hear this paper is quite similar to one that the Rust[1] folk recently authored regarding their "borrowed pointer" system, due to be possibly published next year. If this sort of thing interests you, I suggest you take a look at Rust (though keep in mind that it's still very pre-alpha).
Sorry, in task based parallelism I meant in terms of no need explicit threading, which is both good and bad.
Disclaimer: my thesis is on a very similar system for a different programming language (before this paper was released), with some differences. For example they assume mutable is the default, where I assume readonly is the default.
Automatic parallelization has been the holy grail of performance-based computing since the 1980's or earlier, and if auto-threading compilers had arrived we would all know about it.
The fact remains that for general-purpose code, automatic parallelization is an unsolved and exceedingly difficult problem. So difficult that PG claimed it as one of his highly ambitious startup ideas.
You may find this presentation[1] interesting, written by Tim Sweeny of Epic Games (engine that powers a lot of games).
In the section on concurrency, he notes the hardest part is updating the game logic (tens of thousands of interacting objects) and notes that synchronizing it is hopeless. Instead, STM:
"~2-4X STM performance overhead is acceptable: if it enables our state-intensive code to scale to many threads, it’s still a win. Claim: Transactions are the only plausible solution to concurrent mutable state"
It's also a neat presentation as it's from the perspective of someone running a real large scale, commercial, time-sensitive, _real world_ software project. Yet he talks about the most common types of bugs, and how FP and other advanced concepts would help (like dependent typing).
Thanks for the link. It's interesting to see a game developer's view on FP, though the presentation is a little undermined by the conclusion that includes statements like:
> By 2009, game developers will face CPUs with 20+ cores.
I think that line should be prefaced with the assumption from the previous slide that CPU and GPU would merge.
Anyways, fact we only have ~16 core CPUs today just means he was off on the timeline, and got a slightly bit more "free lunch" for a few more years: a single threaded program gets 12% instead of 5% of a CPU's capability.
The underlying "we're screwed without better tools" still stands. Besides, the work required to take advantage of 8-way concurrency on mutable state is the same required for 32-way.
Yes and no. If you were playing the kind of games we used to have in the Core 2 Duo days, then, yes, this kind of game would be GPU-limited. (Note, I don't mean get a 10 year old game and try playing it, I mean those games updated to run on today's graphics engines).
But today's games involve a completely different paradigm. Almost all the games today have such a huge focus on open worlds and modeling the interaction between hundreds of thousands of objects in real time. Even the most GPU-intensive such games can still realize bigger FPS gains with a CPU upgrade than a GPU upgrade.
Depends on the game, really. Take a look at minecraft and dwarffortress - they are both cpu-bound due to
1. Simulating a large world using complex entities and voxels
2. Being single-threaded with no clear way to make them multi-threaded
Minecraft in particular is a pretty interesting problem since it only simulates the part of the world that's within a radius of a player, it would make sense to have each player's machine simulate their own part of the world and the server to somehow merge those together.
> Minecraft in particular is a pretty interesting problem since it only simulates the part of the world that's within a radius of a player, it would make sense to have each player's machine simulate their own part of the world and the server to somehow merge those together.
I can't imagine people would trust each other enough to allow them to simulate themselves.
Minecraft, to me, is a visual 3d database -- digging dirt doesn't remove something, it merely changes the value for that block in the database -- from "dirt" to "air" -- which changes how the game's algorithms act. And the game still ships with "developer graphics" which would have been easy for a TNT to run.
I never got the impression that DF was single-threaded due to theoretical constraints, only that the game was originally single-threaded and that refactoring to make use of multiple threads is just too difficult for a lone developer on such a sprawling codebase.
Simulating a large world seems like a textbook example of an "embarrassingly parallel" problem, doesn't it? At least that's true for cellular automata.
I think the problem comes from that the world is a large mass of interconnected mutable state. You don't know if updating a particular object will update another.
Think of a player crouching on a plank that another player just phasered. If you continue simulate the first player's movement "crawl forward" by itself, how do you integrate the result of the plank being disintegrated, causing the player to fall? The first player's simulation outcome depends on multiple inputs, but they aren't findable directly from that player's perspective. You have to first simulate the phaser beam to know the plank is gone to know the player is falling now, not crawling.
And then imagine that with far more complex rules and a few hundred thousand objects having similar interactions, each one possibly modifying any other one.
The not-always-pure FP that sometimes uses state is what real-world functional programmers actually use. We manage state. It's neither possible nor desirable to eliminate it outright.
It wasn't until recently (a couple weeks ago, when giving a presentation on FP) that I realized why stateful programming lends itself so easily to evil. If a program is a serial collection of possibly unrelated stateful actions, everyone can add new intermediate behaviors to a function to satisfy some dipshit requirement, and the API doesn't change. Writ large, this allows silent complexity creep.
I think a major reason why FP is better is that changing a purely referentially transparent function requires an API change: more parameters or a different return type. If nothing else, this tends to break tests. It's hard to change functions that already exist, and it should be hard. You should be writing new ones instead. Also, if there's only one way to combine programs (function composition) it's easier to break them up. So you don't get the 500-line monsters that plague enterprise codebases.
That said, the worst thing about OOP isn't state. It's inheritance, which is the 21st-century goto.
I've sometimes heard the complaint that FP is more complicated than imperative programming, what with having to pass more variables around instead of just incrementing counters or whatever. I am of the opinion that FP isn't more complex, but rather it makes visible the complexity that was already there. Programming with states implicitly requires ordering dependencies between those states, which means that there are hidden relationships in your code. Converting states to passed parameters isn't adding complexity, it's exposing it.
I am of the opinion that FP isn't more complex, but rather it makes visible the complexity that was already there.
Edit: This is a very interesting statement, seriously. How easily do we stumble onto a statement whose truth hinges on the purpose of programming languages. There are a lot of ways one can look at this.
For example, the statement may be true but then if failure-to-hide complexity is what's making things hard, something is still making things hard. Reframing someone's problem won't make it go away unless you also hand them a way to deal with it.
Also, isn't whole point of a high level language that it hides complexity appropriately in order to allow the limited human brain to deal with the ungodly complexity of a large computer program? Having to paste ten boiler-plate arguments to a group of functions (when if one was using OO the arguments would be subsumed in the object) might not add complexity to the resulting logical object but without other mitigating factors, it seems to me that such an approach would add to the complexity of the code creation process, which is really the major bottleneck in computer programming, right?
Having to paste ten boiler-plate arguments to a group of functions...
is completely unnecessary. Depending on how you plan to use the boilerplate arguments, you'll typically encapsulate them with either a Reader, Writer or State monad.
Haskell, for example, has records and product types for bundling data.
It also has carrying, for ad hoc accumulation of arguments, so you don't need the boilerplate of ten overloaded methods or a builder class just to accumulate arguments.
I found the analogy to be pretty accurate. It's very easy to shoot yourself in the foot with both goto and inheritance, it's common to see both of them abused, and in either case there are alternatives that generally equivalent but result in cleaner code.
Of course, both of them can also come in handy if you are careful and know what you're doing.
Are you sure he knocked goto? Like inheritance, it has great potential to turn code into spaghetti. Like inheritance, it can be useful if you're careful.
"Are goto’s always bad eventually? Who cares? There’s far bigger problems with gradual change in codebases, let’s think harder about them. My peer reviewers, hold off review until you’ve used what I’ve built and found problems with it. Managers and programmers, wrestle with judgement everyday, the stuff not easily put into rules. Leave the comforting shallows and engage with the abyss."
"The power of a single CPU core has plateaued and won't ever get faster. CPUs are only getting more cores."
Pretty sure that isn't true. It may be getting more difficult to make CPUs faster, but advancements in CPU architecture continue.