Beating C with Futhark Running on GPU

raphlinus · on Oct 25, 2019

Monoid homomorphisms for the win. I discussed a very similar approach for computing the length of the longest line in a "rope science" article[1], as well as unquoting strings[2]. In this case, it was very nice to see actual code, and Futhark looks like a good language. For the string unescaping, I used Nvidia's Thrust, which is a templated C++ library; generally pretty similar to the Futhark code, and generally similar results.

[1]: https://xi-editor.io/docs/rope_science_01.html

[2]: https://raphlinus.github.io/personal/2018/04/25/gpu-unescapi...

Herrin · on Oct 25, 2019

I found this more readable and understandable than the Haskell post, although I can't quite say why. It might simply be the repetition.

I'm really interested in Futhark, though I haven't found a project where it would be make sense to use it. But I feel like it has the same potential to make GPU programming not feel overwhelming the same way Elm did with frontend work for me.

mjaniczek · on Oct 25, 2019

Since Futhark has sum types, I wonder whether we could transpile Elm syntax to Futhark. I'll have to dig into what's possible with Futhark and how well it would map...

earth_walker · on Oct 25, 2019

James Carson gave a talk at this year's Elm Conf on using Elm to talk to Futhark:

https://www.youtube.com/watch?v=FVP8zxpZKV8

Athas · on Oct 25, 2019

The biggest problem would be the absence of recursion, but a recursion-free subset of Elm (with a different standard library) would be straightforward.

e12e · on Oct 25, 2019

> Word counting is primarily IO-bound, and it is much too expensive to ferry the file contents all the way to the GPU over the (relatively) slow PCI Express bus just to do a relatively meagre amount of computation.

After seeing that it's possible to play crysis using software rendering on an AMD Rome cpu with 128 hw threads [1] - might this lead to some vindication for AMD sticking with opencl (assuming exposing such a cpu via opencl) - or is it just simpler to ignore that (in general and for futark) and just use regular threads for parallelizing aacross many cpu cores?

[1] https://news.ycombinator.com/item?id=21339652

Mathnerd314 · on Oct 25, 2019

CUDA is proprietary to NVidia, and is pretty much the standard for GPU computing. AMD's been chipping away with OpenCL, Vulkan/GLSL, https://github.com/RadeonOpenCompute/hcc/wiki, etc. but not much luck so far. I wouldn't say AMD's been "sticking with" OpenCL, if anything it seems like they will deprecate it in a few years, as the plan is to fold OpenCL into Vulkan.

I guess it is possible to use OpenCL on the CPU as well, but it seems to be intended mostly for testing purposes. The Crysis software renderer uses threads: https://github.com/google/swiftshader/blob/master/src/Common...

rrss · on Oct 25, 2019

Last I checked, AMD implemented the cuda apis as "hip."

tom_mellior · on Oct 25, 2019

I'm starting to sound like a broken record on this, but if you're going to compare to your system wc without trying to figure out if it was compiled with -O3 [EDIT: and what source code it was compiled from], you haven't shown anything in the sequential case.

What this article does show is that Futhark really does allow one to express this in a much simpler way than Haskell.

Athas · on Oct 25, 2019

That's a good point, but the -O3 doesn't actually do a whole lot here. I recompiled the Futhark-generated C code with just -O and performance was unchanged. If you look at the generated C code, there isn't really a lot to do either: https://gist.github.com/athas/7c8ffc2620a9406e4bbb0df89f2fc9...

I hope I can assume that RHEL compiles their wc with at least -O.

tom_mellior · on Oct 25, 2019

> That's a good point, but the -O3 doesn't actually do a whole lot here.

True, maybe it's not about -O3 but about some factor in the unknown source code of the system wc. I did compile one version of wc with -O3 and it beat my system wc (Ubuntu) by 2x: https://news.ycombinator.com/item?id=21271951

Athas · on Oct 25, 2019

Honestly, I would expect the main reason my wc is faster is that mmap()ing the file and then reading it in a huge chunk is about as fast as the kernel's IO can go. GNU wc cannot do this in general because it's supposed to work on pipes as well, and I doubt anyone cared enough about the tiny performance difference to exploit the case where the input file is mmap()able.

(I had actually hoped Futhark would be slower sequentially, just so this wouldn't be the focus of the discussion!)

tom_mellior · on Oct 25, 2019

> GNU wc cannot do this in general because it's supposed to work on pipes as well

I used the "reference" source code linked from the original Haskell post, a BSD version hosted by Apple: https://opensource.apple.com/source/text_cmds/text_cmds-68/w...

It uses raw read() from a file descriptor and works with pipes as well. I think the only special handling for stdin vs. an actual file it has is calling fstat() if only the number of characters is requested, which shouldn't apply here.

So yes, this version does need to do more complicated I/O than a simple mmap(). And (broken record, but I'll stop after this) it's 2x as fast as my system's GNU wc (when compiled with -O3 vs. however the system wc was compiled).

> I had actually hoped Futhark would be slower sequentially

It might still turn out to be, if you see if you can get a faster C version of wc.

YSFEJ4SWJUVU6 · on Oct 25, 2019

>It might still turn out to be, if you see if you can get a faster C version of wc.

You definitely can, at least if you allow manually vectorized code.

On my system, with a 1.661GB file (256 times big.txt from the original Haskell post) GNU wc takes about 6.5s (real time), a stripped down version of Apple's implementation about 4.1s, and a single-threaded vectorized wc (written in C) only 0.27s. (These times are of course only with a hot cache. For reference, catting the same file to /dev/null takes about 0.18s.)

edit: corrected the time for the BSD-derived implementation

ummaycoc · on Oct 26, 2019

Would you mind sharing the 0.27s version?

YSFEJ4SWJUVU6 · on Oct 27, 2019

See https://git.io/JeEjB

megous · on Oct 25, 2019

My experience is that memory mapping the file is not faster than reading it in big chunks into an already mapped, pre-allocated buffer.

You're going to be intially page faulting every 4096 bytes if you mmap the file. The fact that you're accessing the mapped range sequentially in this case may help, I guess.

glouwbug · on Oct 25, 2019

Most of -O3's performance is lost across translations units.

Compile and link with `-flto --march=native -O3` and you're good to go.

OskarS · on Oct 25, 2019

Fair enough, but I'm honestly really enjoying this little round of programming language bragging. It's fun to see these smallish (well, Haskell is pretty big) languages all duke it out over how to solve this problem. Like: yeah, nobody's ACTUALLY going to replace their system wc with these implementations, but the articles themselves are nevertheless very fun. A very clever little case study!

yiyus · on Oct 25, 2019

Very honest discussion of the results. I liked it.

Would it be possible to use Futhark to rewrite the APL implementation instead of the Haskell one? That would make an interesting comparison.

Athas · on Oct 25, 2019

> Would it be possible to use Futhark to rewrite the APL implementation instead of the Haskell one? That would make an interesting comparison.

Sadly, from what I can see, the APL version makes use of so-called nested arrays in the 'words' function, specifically arrays of strings (this is different from multidimensional arrays). Futhark does not directly support nested arrays. A rewrite of the APL implementation would require using a quite different algorithm (or a nontrivial encoding).

But my APL is a bit rusty, so I may be wrong.

ummaycoc · on Oct 26, 2019

In my original version I didn't use a nested array of strings, but [Olzd](https://news.ycombinator.com/user?id=olzd) pointed out a way you could do that and I added it as a theoretical first attempt (theoretical since it wasn't my first attempt, but would have been had I been better at APL).

My first attempt did some stuff with subtracting items in an array from their neighbor. Now with info I got from [mlochbaum](https://news.ycombinator.com/user?id=mlochbaum) I have another version that uses windowed reductions.

So those are three versions there; after that I just split it up just to see where that leads and that actually ends up feeling a lot like the Haskell / Futhark solution to me.

ngcc_hk · on Oct 25, 2019

No time working on my I Ching course for my master degree. However I join a programming competition many years ago About an incident of a crazy assignment for a kid. My clone of the github here:

https://github.com/kwccoin/ABCDEFGHPPP

It is fun. I think I even try to use a micro version of cobol to do it. But the fastest is still c.

May be we can start one. Sadly no time to join.

(I think there is a web site that post many versions of the same program. It must have Wc. If not these should be there. )

jb1991 · on Oct 25, 2019

A lot of the Futhark demos you see are rather basic algorithms like matrix multiplication, and the documentation for Futhark does say that it is not well-suited to complex kernels, so that to me puts a big limiting factor on how useful it could be to invest in it.

I really like technologies like this and Sycl which aim to greatly simplify the process of writing GPU code. The important thing is that it can handle what you'd throw at it as if you were writing directly in Metal, Cuda, OpenCL, and I don't think that is the case (yet?) with Futhark.

Athas · on Oct 25, 2019

It depends on what you consider a "complex kernel". Futhark is only for regular non-recursive data parallelism, but I'll argue that something like a genetic algorithm that does calibration of market parameters in the Heston model[0] is pretty complex. It comprises multiple levels of parallelism and several kernels (last I checked, the core work is done in four kernels which are invoked in a loop).

But more importably, this benchmark is written as a composition of two reusable parts (a genetic algorithm that is parametric in its objective function, and a specific objective function that does option pricing) that are then put together in an efficient and automatic way by the compiler. You literally could not write it this way in OpenCL or CUDA (modulo extreme amounts of template metaprogramming in the latter). While you could certainly write a specialised GPU program that did exactly this calibration, and probably outperform Futhark, you would not be able to structure it as reusable components without significant performance loss. This, I think, is the main advantage of using a high-level language together with an optimising compiler.

[0]: https://github.com/diku-dk/futhark-benchmarks/tree/master/mi...

jb1991 · on Oct 25, 2019

Thank you for this thoughtful reply, really appreciate it.

sword_smith · on Oct 25, 2019

This sounds smart. I haven't programmed Futhark yet, though, but I really enjoy functional programming. Where is this language primarily used, at universities or also in industry?

olodus · on Oct 25, 2019

I got to use it in university in a course about parallel functional programming. I think it is still mostly a research language, but if you need to do computation on the GPU and like func prog then it probably would be an interesting alternative to try out. It really does a great job of optimizing and parallizing you program.

sword_smith · on Oct 25, 2019

Which university? :)

olodus · on Oct 25, 2019

Chalmers University of Technology, in Gothenburg. Though Futhark is developed in Copenhagen I think. We had a guest lecture with one of the developers of the language. Was a really interesting lecture together with some exercise for us to play with the Lang. Though it has gotten even better since then from what I've seen in their blog.

sword_smith · on Oct 25, 2019

I wonder if this language has spread beyond Europe yet. Do any Americans use it?

jacquesm · on Oct 25, 2019

Beating CPU with GPU would have been a better title.

Svip · on Oct 25, 2019

I think the title is a reference to the two previous articles that inspired it. Like the old 'X considered harmful' format.

jsd1982 · on Oct 25, 2019

One could easily amortize the startup cost by putting the backend logic into a background service where it only starts up once and continues running. A frontend program like `wc` would just forward requests to the backend service.

tempodox · on Oct 25, 2019

A nice, unobtrusive way of showing monads at work.

_fq4v · on Oct 25, 2019

These are monoids, not monads.

makz · on Oct 25, 2019

GPU supremacy

_wldu · on Oct 25, 2019

C is the Mike Tyson of programming languages. There will never be another like it. It's simple, dangerous and fast. You can't beat C, but everyone will keep trying. It may beat itself in the end though as it's too rough for the modern world.

pjmlp · on Oct 25, 2019

"Oh, it was quite a while ago. I kind of stopped when C came out. That was a big blow. We were making so much good progress on optimizations and transformations. We were getting rid of just one nice problem after another. When C came out, at one of the SIGPLAN compiler conferences, there was a debate between Steve Johnson from Bell Labs, who was supporting C, and one of our people, Bill Harrison, who was working on a project that I had at that time supporting automatic optimization...The nubbin of the debate was Steve's defense of not having to build optimizers anymore because the programmer would take care of it. That it was really a programmer's issue.... Seibel: Do you think C is a reasonable language if they had restricted its use to operating-system kernels? Allen: Oh, yeah. That would have been fine. And, in fact, you need to have something like that, something where experts can really fine-tune without big bottlenecks because those are key problems to solve. By 1960, we had a long list of amazing languages: Lisp, APL, Fortran, COBOL, Algol 60. These are higher-level than C. We have seriously regressed, since C developed. C has destroyed our ability to advance the state of the art in automatic optimization, automatic parallelization, automatic mapping of a high-level language to the machine. This is one of the reasons compilers are ... basically not taught much anymore in the colleges and universities."

-- Fran Allen interview, Excerpted from: Peter Seibel. Coders at Work: Reflections on the Craft of Programming

jacquesm · on Oct 25, 2019

That's a nice quote but it is also a bit nonsensical in that there has been plenty of work on optimizing compilers, also for C/C++ and a whole bunch of other languages.

That quote was true for some time but running into the limits of CPU clock gains for single threaded tasks all those optimizations are more than valid once more, and as you've no doubt noticed there is a veritable run on multi-core solutions embedded deeply in the languages, of which this article illustrates one special case using a co-processor.

pjmlp · on Oct 25, 2019

The point being that generated code from C compilers being always the one to beat is an urban myth.

C compilers are only speed monsters thanks to almost 50 years of research in optimizing C and C++ compiler backends.

jacquesm · on Oct 25, 2019

> The point being that generated code from C compilers being always the one to beat is an urban myth.

Ok.

> C compilers are only speed monsters thanks to almost 50 years of research in optimizing C and C++ compiler backends.

Lots of those optimizations apply in one form or another to other languages as well.

And... so which is it? Are C compilers fast and the thing to beat or did that research go nowhere?

C is the speed benchmark because 'beating C' is what will get you a foot in the door. Being 'slower than C' is going to get your pet language booted out the door because:

- companies tend to compete on speed of execution

- the speed of the compiler itself is a major factor in turnaround time for the typical edit-compile-test cycle

- nobody cares about security until they've been bitten hard.

This is all very frustrating but it seems to - in my experience - accurately reflect priorities in lots of corporations. It is up to us to change that.

See the title: it is about the speed of execution, and it uses one technology 'GPU' to challenge another 'CPU' and yet the accent is on which language was used.

dooglius · on Oct 25, 2019

I wonder if Allen considered that there might be a reason people chose to use C, it's not as if they were forced into it. "Something where experts can really fine-tune without big bottlenecks" is not a requirement specific to operating systems. It's a bit ironic to think that the inability to run fancy optimizations is considered a problem with C when one of the most widely recognized flaws in the C ecosystem is the way mainstream compilers will exploit behavior that was labeled "undefined" just to support old obscure architectures, to run all kinds of crazy transformations.

pjmlp · on Oct 25, 2019

The same reason people got to choose JavaScript or PHP years later, the platform's adoption, in this case UNIX.

jacquesm · on Oct 25, 2019

I don't think that's true. When I came of age in programming there were lots of choices already but none that offered the balance of control and speed that C gave. It was a very easy choice; do I invest my next two decades in Assembly, Pascal (in several flavors), Modula-2, Forth, LISP (which required unobtainable computers at the time), Basic (compiled or interpreted) or C?

UNIX had very little to do with it, only very few people were lucky enough to have access to UNIX machines but untold 100's of thousands had access to PCs or 8 bit micros. The first time I saw a UNIX machine it was an Acorn 'Unicorn' and it was so far ahead of what I could afford that it might as well not exist.

pjmlp · on Oct 25, 2019

Sure it was, there were zero reasons to use C on CP/M, MS-DOS, Atari, Amiga, Mac.

It was just another programming language fighting for developer eyes.

On Windows and OS/2, although IBM and Microsoft decided to go with C for the underlying low level layers, C++ was the way to go for high level coding, C Set++, MFC. With Borland having Turbo Vision, OWL and VCL.

Macs were Object Pascal territory, and when MPW got C and C++ support, PowerPlant C++ framework was the way to go.

Epoch, BeOS and Symbian were also C++ territory.

MS-DOS games were also adopting C++ via Watcom and its DOS extender.

UNIX and the rise of FOSS, based on UNIX culture, were definitely the only reason.

flohofwoe · on Oct 25, 2019

IME C++ wasn't really a mainstream option until the second half of the 90's. BeOS choosing C++ for its operating system APIs was an extremely exotic choice at the time (same level of "exotic" as NeXT choosing Objective-C).

And IMHO, at that time, before or around C++98, C++ didn't fix a single problem of C, but instead just added a lot of new ones (one could argue that this is still the case even today).

zabzonk · on Oct 25, 2019

> IME C++ wasn't really a mainstream option until the second half of the 90's

I was getting paid for teaching C++ on commercial training courses in 1990 - the C++ courses were probably the most popular after C and UNIX

> before or around C++98, C++ didn't fix a single problem of C

Of course it did, or why would people like me have transferred wholesale from C to C++?

pjmlp · on Oct 25, 2019

It was so exotic that at my university, FCT/UNL, starting in 1992 they switched the first year students to learn Pascal followed by C++.

C was never taught as such, as any student was expected to know it from their C++ classes.

The professor was a great teacher of what all the ways that C++ fixed C's problems, by providing his own data structures for strings, arrays, vectors, linked lists and hash tables, all with bounds checking enabled by default on his implementation.

Other issues that C++ fixed over C was having implicit conversions, a proper way to allocate memory (malloc() with sizeof, really?), ability to ensure valid pointers via references.

jstimpfle · on Oct 25, 2019

C++ fixes many of C's problems and introduces so many more that are not fixable. ("But why don't you write const-correct code? Why don't you use move semantics? Use the rule of three!")

> malloc() with sizeof, really?

One tiny macro to solve 20% of all the problems people are whining about.

    void _alloc_memory(void **ptr, size_t numElems, size_t elemSize)
    {
        size_t numBytes = safe_multiply(numElems, elemSize);
        void *p = malloc(numBytes);
        if (p == NULL)
            fatal("OOM!\n");
        *ptr = p;
    }

    #define ALLOC_MEMORY(ptr, numElems) _alloc_memory((ptr), (numElems), sizeof **(ptr))

    int *myArray;
    ALLOC_MEMORY(&myArray, 25);

I've been using this for years without a problem.

pjmlp · on Oct 26, 2019

That is the thing C++ doesn't require developer boilerplate for something that even Algol supports properly.

Plus all C workarounds for "safe" code tend to fall apart when teams scale above 1 team member, as it keeps being proven by endless industry and academic reports.

Now Android NDK is Fortify enabled, with hardware tagging planned for all new ARM based models.

jstimpfle · on Oct 26, 2019

> boilerplate

There is a difference between "explicit code" (which is mostly a good thing) and "boilerplate". It's saying what you mean vs saying what the platform requires you to say (or repeat, involuntarily). C definitely leads to the former, but not to the latter.

That said. It's a 1 line macro! You are being ridiculous. The amount of insanity we have to go through in so many other languages constantly, not just for setting a good base, is something completely different. It's not measured in handfuls of lines, but in number of hairs pulled out.

Compare:

    #define ALLOC_MEMORY(ptr, numElems) _alloc_memory((ptr), (numElems), sizeof **(ptr))

    https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/stl_vector.h

(Not a fair comparison, but I think it makes a very good point)

> for something that even Algol supports properly.

Probably with an allocation builtin, not allowing for custom allocators?

> supports properly.

There is nothing in this simple macro that isn't "proper". There are no ways it can break (although I still feel it would be nice to have expressions macros, not only token macros). The only requirement is you write it yourself. C in general doesn't like to give you canned things. It has made such mistakes in the past (see large parts of libc) and has actually learned from it.

> Plus all C workarounds for "safe" code tend to fall apart when teams scale above 1 team member, as it keeps being proven by endless industry and academic reports.

You could absolutely implement C with managed memory. It just wouldn't be a good idea. Use other languages if you want these tradeoffs.

Some of the most massive codebases in the world are C. (Often disguised as "C++ by experienced devs"). They are maintainable, protected investments, many decades old, and still in working order, precisely because a minimalistic language approach leads to modular APIs. It scales very well, and the "problem" might be mostly that the defect rate per line doesn't go down as the lines go up.

But the best feature of these codebases is that they exist, because C enables independent development of subsystems much better than the intertwingled messes and dead ends that most "statically-systematic" approaches lead to on non-trivial scales.

It's a huge boon that I don't have to think about rewriting interfaces using multiple inheritance or virtual inheritance or template insanity or SFINAE or unique_ptr or move semantics or rvalue references or static assertions or compile time evaluation, or the next fad around the corner, every 5 years.

jacquesm · on Oct 26, 2019

For a slightly improved version, place a sentinel (say a 4 byte magic number) just before and after the allocated segment. Check on FREE_MEMORY if they're still there. Not perfect but pretty good as an early warning system.

jstimpfle · on Oct 26, 2019

Yes. Another possibility, add something with __FILE__ and __LINE__ if there are memory leaks to debug. (Never did that but it should help).

flohofwoe · on Oct 25, 2019

C was (and still is) a fairly obvious choice for programming when you need full control over the memory layout of an application (which is becoming all the more important with the growing CPU/memory gap). I choose C 15 years before I got into contact with UNIX (first on the Amiga, after that on Windows, and only fairly recently macOS and Linux).

pjmlp · on Oct 25, 2019

Nothing special about C, plenty of other languages offer similar features.

ISO C doesn't offer full control over anything beyond the abstract memory model of the standard.

jstimpfle · on Oct 25, 2019

So what? Chill out.

We all know that C is not without flaws. I would use other languages if there were good alternatives. For example, I liked some parts of Delphi, but it has too many show stoppers. For example, all local variables still have to be declared in the variables section before the function body, right? And are we still required to make type aliases to use pointer types in important places, such as function signatures?

C is special in that it is the only language that I've known that has a minimalistic attitude, resulting in a shitty language (every language is shit!) whose problems we can actually work around in practice.

pjmlp · on Oct 26, 2019

Just like C used to declare variables until C99.

Delphi is not Go, you can declare variables at the point of use nowadays.

Just like you can declare pointer types in function signatures, which would fail my code review, as those things tend to get out of hand.

Minimalist attitude leads to write only code bases, where it is impossible to maintain on long term projects with regular rotating team members, as it usually happens in most multinationals.

A workaround done today is a security exploit waiting to happen tomorrow.

jstimpfle · on Oct 26, 2019

> Delphi is not Go, you can declare variables at the point of use nowadays.

Oh, it seems they introduced it in 2018, some time after I quit my 6-months Delphi stint. http://blog.marcocantu.com/blog/2018-october-inline-variable...

So, given the age of Delphi (and Pascal!), Go still has plenty of time to be quicker. Not sure what's missing from it, though.

Of course, I'm sure you knew all of that, and could have just mentioned it. But maybe you just want to convince people of unrealistic propositions, and claim that some obscure technologies were more practical than they really are.

> Just like C used to declare variables until C99.

1. C99 was 20 years ago, 19 years before Delphi.

2. What you say is wrong. You could declare variables at the start of any block since forever (I think it's standardized in C89).

3. What matters is compilers in practice, and I'm pretty sure they allowed you to declare variables anywhere, and also "for (int i..." (which is C99) since forever (as an extension).

> write only code bases

such as C++ code bases?

> regular rotating team members

I've just never seen that not becoming a mess

> A workaround done today is a security exploit waiting to happen tomorrow.

I'm still waiting for my code review. https://news.ycombinator.com/item?id=21290314 . For a start, where are my terrible workarounds?

jstimpfle · on Oct 25, 2019

It's funny, when I'm not on Unix, C is still my go to language which lets me get shit done. I won't have to deal with performance problems or painful FFIs. Yesterday I was dealing with Webassembly, and having it interoperate with WebGL made me pull out the last few hairs I still had on my head. Go figure.

namirez · on Oct 25, 2019

C is not a fad; it's an outlier among languages in the sense that basically it's a portable assembler very close to the metal. If you change the basic architecture of the machine, you can create a better language. For now however it's unlikely to beat C.

gmueckl · on Oct 25, 2019

I'm not sure that this high level assembler assumption still holds for SIMD-capable CPUs. C compilers are asked to do quite drastic code transformations like autovectorization on these architectures. With these, the tight relationship between the high level code C code and the generated machine code is removed.

megous · on Oct 25, 2019

You can still treat it that way even with SIMD. I quite enjoy using NEON (ARM SIMD) intrinsics in C, and the like.

gmueckl · on Oct 25, 2019

When you do that, you write pretty much the equivalent of platform specific assembly code (not exactly, but the differences don't matter for what I want to say). What I am saying is that modern compilers also take your "dumb" code that is not SIMD, but just a pedestrian implementation of something and they still turn it into SIMD or do other very drastic rewrites to it that are hard to reason about. And these optimizations and transformations tend to stack. Something that had inlining, tail recursion optimizations and autovectorization applies to it may end up retaining absolutely no resemblance to what was actually written as C code. Most of the nice properties of C as a low level language come from the fact that you can map the code to assembler in your head - as long as the compiler is not trying to get too clever. Then the intuition becomes merely an illusion and the whole thing becomes harder to use. For example, strict requirements like strict ordering requirements for accesses to hardware registers in a device driver. C has come to a point where you have to pull stunts to prevent the compiler from reordering your memory accesses.

jstimpfle · on Oct 25, 2019

I don't have experience with SIMD, and for the things I do I couldn't care less. I like C as a super-productive language that doesn't get in my way. The output from unoptimized code is magnitudes faster than what I get from higher level scripting languages. And much more efficient than with GC languages for any non-trivial stuff. And I can write that code almost as quick as Python or Java code, and with very little debugging time (after some years of experience).

I might be in the minority, but to me, C is as high-level as we should go, for many many problems. If you really care about registers and SIMD stuff, then your concerns are architecture specific, and that's not really what C does well. What C does is mostly abstracting registers. Why blame it for that? The few places where you need SIMD, well, just insert architecture specific code there.

Is there a way to write portable code that can be better optimized?

namirez · on Oct 25, 2019

> they still turn it into SIMD or do other very drastic rewrites to it that are hard to reason about

Maybe it's just me, but I haven't seen this being a major problem in C. Most optimizations are local and fairly easy to reason about. C++ is a whole different story.

pjmlp · on Oct 25, 2019

Any language can have intrinsics, in fact the first systems language with intrinsics support appeared 10 years before C was created.

nickpsecurity · on Oct 25, 2019

They did change the architecture of the machine. C didn't have concurrency and SIMD built-in.

pjmlp · on Oct 25, 2019

Portable assembler for an abstract machine modelled on what PDP-11 processors used to be.

OskarS · on Oct 25, 2019

I mean, Futhark certainly can. The whole point of Futhark is that it's a functional language that can run on GPUs. Futhark will beat the pants off of C for most any problem that is suitable for GPU computation, even if written in an entirely functional style.

For CPUs, I'd like to introduce you to my good friend Fortran.

Crinus · on Oct 25, 2019

Also for CPUs there is ispc (https://ispc.github.io/) which provides a language that takes advantage of modern CPU's parallel friendly features with a C-like syntax. From the site:

> ispc compiles a C-based SPMD programming language to run on the SIMD units of CPUs and the Intel Xeon Phi™ architecture; it frequently provides a 3x or more speedup on CPUs with 4-wide vector SSE units and 5x-6x on CPUs with 8-wide AVX vector units, without any of the difficulty of writing intrinsics code. Parallelization across multiple cores is also supported by ispc, making it possible to write programs that achieve performance improvement that scales by both number of cores and vector unit size.

There is also an interesting story of its development (full with office politics drama :-P) written by its -then- main developer: https://pharr.org/matt/blog/2018/04/18/ispc-origins.html

yiyus · on Oct 25, 2019

> For CPUs, I'd like to introduce you to my good friend Fortran.

Fortran scares people because it is often identified with FORTRAN77. I find very unfortunate that the concept of F (a modern simplified Fortran without the legacy features [1]) never took off, although I understand the reasons.

[1] https://www.fortran.com/F/index.html

gridlockd · on Oct 25, 2019

That's an apples to oranges comparison, because Futhark needs to compete with C on the GPU (CUDA/OpenCL), not C on the CPU.

That's what I'm missing from these benchmarks - how does it fare against a handwritten, competent implementation in those languages?

Athas · on Oct 25, 2019

You are right that it's interesting to compare Futhark-on-GPU with X-on-GPU for various values of X. In most of our academic work, that is what we do.

However, in practice, X-on-GPU where X is not Futhark is rare, because GPU programming is notoriously difficult and time-consuming. Futhark's purpose is to make high-performance data-parallel programming more accessible, even if you could potentially write a faster program yourself.

There are no empirical measurements that I know of, but I would not be surprised if it is a hundred times faster to write a Futhark program than the corresponding OpenCL program. CUDA fares a little better, but not by much. So even if your hand-written program might be twice as fast as Futhark, do you really have the time to write it in the first place? And if you later want to make a small change to its logic (say, adding another parallel loop on top), you may need to rework all of your optimisations from scratch.

OskarS · on Oct 25, 2019

In the linked post, the author tests Futhark running on the CPU without any parallelism or low-level optimizations, and it still beat GNU wc.

Fair enough about apples to oranges, but it was really distasteful to me that the top comment on this post was about how C is “unbeatable”, when the article clearly showed that Futhark was faster. And since we’re at the dawn of a new age of parallel computing, statements like that are absurd in general.

mlyle · on Oct 25, 2019

> In the linked post, the author tests Futhark running on the CPU without any parallelism or low-level optimizations, and it still beat GNU wc.

Barely...

It also mmaps in a whole 100 meg file. :P wc is not optimized to count as fast as possible, resource use be damned.

penagwin · on Oct 25, 2019

Yeah this is what I'd like to see, Futhark vs Cuda.

If your application is better suited for GPUs of course it'll be faster than C on the cpu. GPU mining and AI training are done on the GPU for a reason, and they of course beat "C" on the cpu.

gilmi · on Oct 25, 2019

https://futhark-lang.org/performance.html

gridlockd · on Oct 25, 2019

This is comparing against a high-level library (Thrust) that offers a comparable level of convenience.

That's fair, but it tells you nothing about the performance gap introduced by these high-level abstractions.

Athas · on Oct 25, 2019

There is also a comparison against HotSpot, which is a hand-written GPU program (albeit imperfectly).

Futhark does not outperform expertly hand-optimised GPU code, but most of the GPU code found in the wild is hardly expertly hand-optimised. Futhark comes out on top surprisingly often, but can be solidly beaten for complex algorithms or clever implementations. See figure 8 in this paper for examples: https://futhark-lang.org/publications/ppopp19.pdf

shrubble · on Oct 25, 2019

Both lovers of C and its detractors should read the Turing award speech of Tony Hoare http://www.cs.fsu.edu/~engelen/courses/COP4610/hoare.pdf

com · on Oct 25, 2019

I found this an interesting and illuminating read, partially for how some of the basic concerns that we have for language tools like compilers were imagined into being and later internalised by us all, but also for the “fear and horror” around bounds checks.

reikonomusha · on Oct 25, 2019

I’d call a language like Scheme simple. C—especially post ANSI—is veritably not simple.

laughingman2 · on Oct 25, 2019

Rust.

popee · on Oct 25, 2019

C is Sparta! :-)