Nodejs+Fabric as fast as multi-threaded C++

allertonm · on Nov 12, 2011

Not that it diminishes the interestingness of this post (quite the opposite), but it's worth noting that this is not achieved using plain-vanilla JavaScript. From the product page:

"The high-performance parts of the application are written using a performance-specific extension to JavaScript, called KL (Kernel Language). This language is similar in scope and syntax to JavaScript, but has some key differences that optimize it for writing high-performance code.

Fabric applications are described as a dependency graph (think of a flow chart). The dependency graph describes data, and the transformations that must happen to that data. Fabric analyses the graph to discover where it is possible to perform tasks in parallel. (task-based decomposition) Fabric analyses the graph to discover where it is possible to perform the same instruction on lots of data at the same time.

All of this is possible because Fabric has the LLVM compiler embedded within it. This means that applications are dynamically compiled on target. The Fabric plug-in has to be written for each platform, but Fabric applications only have to be written once.

Fabric handles CPU multi-threading automatically, but the developer must explicitly write code for the GPU using OpenCL."

sehugg · on Nov 12, 2011

In other words, a better headline would be "Fabric Engine fast as multithreaded C++ (and can be glued to JS)"

copper · on Nov 12, 2011

And, apart from a stray use of new and delete, this looks almost like valid C99 code to me. Not that that diminishes the achievement any, of course. I'd quibble mildly about his choice of compile flags and make some minor stylistic changes to the code (like the tacky transpose :), but I don't think it would affect the benchmark results any.

Given the code used, it looks like it might be straightforward to use openmp rather than pthreads. I'd be interested in seeing how that works out.

Game_Ender · on Nov 12, 2011

I would agree. To me it looks they have made an easier way to create programs that process data in parallel efficiently and made it easy to use them from dynamic languages.

It's not so much "java script is as fast as C++", but "our system is a fast as C++ and you can use it from java script". They are using "Node.js" and "JavaScript" in title for the marketing angle if anything, because people don't use Node.js and JavaScript if they are trying to make parallel data processing programs.

FabricPaul · on Nov 12, 2011

This is accurate - our intent is to do the same thing with Ruby and Django. We are doing additional work on the server side so that it makes more sense to work with node.

We picked node as a first target because we were already using it for our client-side work - it was a very fast integration.

itsnotvalid · on Nov 12, 2011

It is something that JavaScript can never solve: Objects are objects, there has to be some constructs to make them work so it would never be as fast as C code.

The other blog post that appeared on HN this week sums it out [1].

For those who didn't read that post, basically, if you want the performance of C, you have to make those data as static and inflexible, i.e. use types and avoid indirection. Looking at the KL used in benchmark, you may find that it resembles C code. It's exactly what they did to make it fast: types and free from indirection.

[1]: http://blog.mrale.ph/post/12396216081/the-trap-of-the-perfor...

[2]: https://github.com/fabric-engine/Benchmarks/blob/master/Serv...

beagle3 · on Nov 12, 2011

Lua is imilar to JavaScript in that respect, and LuaJIT2 has effectively solved this problem for Lua. Code that's perfectly dynamic is almost on par with C, with variables switching type from int to double to string as needed. With essentially no restrictions on what you use.

JavaScript is more braindamaged in its dynamicness, so it's harder to write something like LuaJIT2 for JavaScript -- but it's not impossible.

_3u10 · on Nov 12, 2011

That's incorrect. Objects aren't objects, in imperative languages their largely syntactic sugar for passing struct as a this pointer, and providing a vtable. They don't actually implement many of the ideas from OO theory such as sending messages to objects.

Look at a language OCaml which can consistently beat C or C++ which also frequently beats C.

Most of the reason why C is fast is because it mirrors the hardware so closely allowing people to write reasonably optimized portable assembler, C falls down on macro optimizations such as inlining function pointers that are available to higher level languages such as OCaml. Sometimes the marco and runtime optimizations are far more important than the micro optimizations.

You don't need static typing or lack of object support to write fast code. OCaml beats C providing both inferred typing and object support.

sausagefeet · on Nov 12, 2011

Have a link to benchmarks where Ocaml beats C? In my experience Ocaml is usually only about 2x slower than C, but never beating it. Also, Ocaml is statically typed. Type inference is just inferring the static types.

_3u10 · on Nov 12, 2011

http://shootout.alioth.debian.org/u32/benchmark.php?test=all...

http://flyingfrogblog.blogspot.com/2009/07/ocaml-vs-f-burrow...

You're right about OCaml generally being 1.5X slower than C but it does beat it for some problems which is impressive for all the additional features it provides. I must have been looking at the numbers wrong last time I checked.

Type inference is much different than static typing because it allows functions to be specialized at run/compile time which can result in having to write much less code.

For example, in C a map function would have to be written for every datatype (or lose the benefits of static typing by using a void*) where as in OCaml/F# you get type safety and specialized / inlined code for free.

sausagefeet · on Nov 13, 2011

> Type inference is much different than static typing because it allows functions to be specialized at run/compile time which can result in having to write much less code.

You are confused, type inference is purely about determining what type something is. It can determine a function is polymorphic but this has nothing to do with how the polymorphism is implemented. AFAIK Ocaml (by that I mean the INRIA Ocaml implementation) doesn't specialize polymorphic functions at all. Types are boxed and the boxes are all the same size so a single function definition is all that is needed for a polymorphic function or type. .net does do these things but, again, that has nothing to do with type inference. There is no difference between me specifying a function has type 'a -> 'b and the compiler inferring that.

Locke1689 · on Nov 13, 2011

This is correct. The most common algorithm used for type inference is Hindley-Milner (http://en.wikipedia.org/wiki/Hindley%E2%80%93Milner).

jamii · on Nov 13, 2011

I've never seen Ocaml come even close to C in micro-benchmarks but in large applications it can win on macro-optimisations eg Deens is significantly faster than BIND with much less code (www.utdallas.edu/~hamlen/Slides/Melange.pdf)

FabricPaul · on Nov 12, 2011

Hi there - yes, it's partly about marketing message. When we did early validation, people were concerned about a 'custom language' - the reality is that the majority of the application is written in JS - the performance parts are in KL. KL itself is almost identical to JS, but right now you have to declare types. In the future we may switch to inferred types, at which point code will be indistinguishable from JS. Just don't use closures :)

almost · on Nov 12, 2011

> Just don't use closures :)

So very much not Javascript then

FabricPaul · on Nov 12, 2011

Closures break the high-performance part of things. If we could have just used JavaScript, we would have done so - KL is as close to JavaScript as we can get, and still offer native, multi-threaded performance. If we introduce inferred types, then the only difference will be 'no closures'.

almost · on Nov 13, 2011

That makes sense, I'm sure it is hard to optimize closures and untyped code. However, if you remove those you no longer have JavaScript so it is incorrect to say that you have made Javascript fast, you've just created a different language and made that fast. Of course if it plays nicely with Javascript that's cool, but please say that!

FabricPaul · on Nov 13, 2011

I agree. I have made a separate post on this thread that calls this out.

Thanks -Paul

feralchimp · on Nov 13, 2011

This is all new to me so perhaps I'm misinterpreting it completely...but if Fabric works by analyzing program/data flow to automatically add parallelism, what's stopping it from being applied to a single-threaded C or C++ implementation and raising its performance to the level of the multithreaded-by-hand C++ implementation?

With all respect for the developers' showcase involving a bridge to Javascript, my gut sense is that the world would pay (more) money for a tool that takes single-threaded C code and makes it run like multi-threaded C code.

Also wondering how much of the underlying technology is patented, or if this is an implementation of publicly-accessible (and cheap/free licensable) academic research.

Finally, thanks to FabricPaul for very informative comments in this thread.

FabricPaul · on Nov 13, 2011

Hi - the parallelism is derived from the dependency graph, which provides task and data based (SIMD) parallelism. So, if you describe a bad graph, then you'll get limited concurrency. The performance you get is directly related to how well you can describe concurrency in your graph.

Dependency graphs aren't new - Intel TBB recently introduced a dependency graph model, and it's a well known approach. So if you're looking for a tool for making it easier to run concurrent C++, TBB is a great option.

Companies have tried the 'analyze code and work out how to multi-thread it' approach with limited success - it's pretty hard to do that in a dependable manner. Often you see horrible code snippets being inserted as flags etc - which very quickly gets messy and intrusive. It's a lot easier to describe the parallelism at a high level and let the engine handle it from there.

Our view is that bringing this kind of performance to dynamic languages is a big opportunity. Our reasoning: - Modern hardware requires developers to write concurrent applications - the days of one big core are long gone. This is hard. - Dynamic languages are flexible, easy to work with and fast to iterate. They're also very slow compared to native code, let alone multi-threaded code. - There are a lot of people that know dynamic languages that can't, won't or don't want to work with compiled languages.

That said - Fabric will be free for non-commercial use (students, researchers etc). We open-source everything we build on top of the core engine - all of the extensions, 3D scene graph, rendering etc.

As for patenting - we've filed around some of the client-side stuff (I can't disclose details just yet - sorry). Most of the ideas that went into Fabric are not original - it's really how we've combined them to offer something that's (hopefully) compelling. We were very lucky to come at this problem when there was a perfect convergence of technologies - particularly JavaScript, HTML5 and LLVM.

Apologies for the ramble - been a long (but gratifying) day :)

rbranson · on Nov 12, 2011

Right... this would be a lot more interesting if they picked existing open source implementations and ran them unmodified. It's not like people can drop their unmodified software into this and make it scream like what V8 did to plain JavaScript performance.

FabricPaul · on Nov 12, 2011

There's a good reason for that - JavaScript is not a high-performance language. We wouldn't have had to author KL otherwise :) The high-performance parts of the application use a strongly-typed, procedural variant of JS (KL). There's no other way to deal with it.

It came up a lot in validation, so we wrote this: http://fabric-engine.com/2011/10/couldnt-you-just-use-javasc...

I hope that's helpful

mmaunder · on Nov 12, 2011

Is KL open source? In other words, if I modify my JS apps to include KL, am I locked in to you as a vendor? Thanks, interesting post.

FabricPaul · on Nov 12, 2011

Hi there - the engine itself is closed source, everything we build on top of it is open-sourced. So if someone built a competing engine that used KL (not sure why they would!), then you would be free to take your code wherever you liked.

If we move to type inference in the future, then the KL code will be JavaScript. Right now we declare types, but in the future we'd like to just throw a compiler error if we can't infer. That way people will have completely portable code.

azakai · on Nov 12, 2011

> That way people will have completely portable code.

Makes sense about portability, but what is your security model? Are you using static analysis, sandboxing like NaCl, or something else?

FabricPaul · on Nov 12, 2011

It's actually a reason why we went down the 'create your own language for operators' path - we originally wrote our operators in C++, but when we shifted to the browser it was clear that wasn't going to go well :) We looked at sand-boxing, but given how long the NaCl guys took to get it working, we really didn't want to take that on as a start-up. We also wanted something that a web developer would be comfortable working with - so we looked at JS and decided we could create a variant for performance.

We don't give pointers, and we do things client-side like guarded arrays (not necessary on the server). Memory management is handled by the core. I will get our guys to put a post up that covers KL in more detail - security was a major consideration for us.

azakai · on Nov 12, 2011

Interesting, thanks. Looking forward to that post.

FabricPaul · on Nov 12, 2011

I linked this elsewhere on the thread, but you might find it interesting in the meantime: http://fabric-engine.com/2011/10/couldnt-you-just-use-javasc...

There are more details on KL, and the reasoning why we didn't use JavaScript

azakai · on Nov 12, 2011

Thanks, that was interesting.

While I have you here :) , can you please elaborate on your Bullet/Fabric demo? Specifically, how did you get the Bullet C++ code to run on the Fabric Engine? (Since that doesn't use C++?)

FabricPaul · on Nov 12, 2011

Sure - we designed Fabric to be extendable, so it can include existing c++ libraries. So we just integrated the Bullet SDK. http://fabric-engine.com/2011/11/building-extensions-on-wind... covers the extension model in detail.

Does that answer your question properly? I can get more info if needed.

azakai · on Nov 12, 2011

Thanks, that does answer my question.

I asked because I work on compiling C++ into JavaScript, and I was curious if your technology included something to compile C++ into KL.

I assume integrating existing C++ libraries to Fabric Engine has no security model, then?

FabricPaul · on Nov 12, 2011

Right - they have to be 'trusted', much like a plugin. People will have to explicitly choose to install them. Right now we bundle them all in he plugin, but that's just until we write the manager

azakai · on Nov 12, 2011

Makes sense, thanks again for all the replies.

master · on Nov 12, 2011

Any ideas on how this model compares to HipHop (Facebook's PHP-to-C system)?

FabricPaul · on Nov 12, 2011

This is our first 'proper' benchmark - we are going to work on some other problems for comparison. We're getting impressive results with semantic analysis, so we'll publish some results there soon.

Fabric is most impressive when problems lend themselves well to parallelism - this first benchmark test is ridiculously good for us. Other tests are likely to vary significantly - however, even single-threaded we are running fast.

Another note - right now the only parallelism model we offer is the dependency graph, which is great for some problems (particular in 3D). As we move to other models for the server (like MapReduce), we may see better performance for certain classes of problem.

VMG · on Nov 12, 2011

The C++ versions were compiled using gcc version 4.4.5 using the compiler flags “-O6 – lpthread”.

Isn't -O6 the same as -O3 ?

pjin · on Nov 12, 2011

Fortran compilers (gcc has a Fortran 95 frontend) have long offered optimization levels beyond -O3; i.e., -O6 is a decent upper bound among popular free and commercial compilers, not just gcc/gfortran. The author probably came from a HPC background where -O6 is commonly used for compatibility among such a variety of compilers, even when writing C and C++ programs. Technically not the most portable way to build, but it works.

FabricPaul · on Nov 13, 2011

I asked the engineer for a response: "For reference, I used -O6 because it's a historical convention (more of a joke, really) for "optimize the crap out of it". UNIX geeks have been using -O6 in this way for about 30 years."

Game_Ender · on Nov 12, 2011

Yes it was a very WTF moment for me as well. The "C++" code is pretty much straight C, with references being the only C++ feature being used.

pjmlp · on Nov 12, 2011

It might look like C, but the last time I looked to the C++ standard that is also valid C++ code.

apaprocki · on Nov 12, 2011

Yes -O3 is the maximum. Anything higher than (-O1000) just runs -O3.

dubajj · on Nov 12, 2011

Defeating poorly optimized c++ code is far from impressive. I would also recommend NOT running any performance tests on EC2 since they throttle cpu unpredictably.

blub · on Nov 12, 2011

I wouldn't call that code C++ at all, it's C with a few C++ keywords/niceties. [1],[2]

[1] https://github.com/fabric-engine/Benchmarks/blob/f11cf6cc8cf... [2] https://github.com/fabric-engine/Benchmarks/blob/f11cf6cc8cf...

Game_Ender · on Nov 12, 2011

Yep and here is the C version: http://pastebin.com/z7apLBHb and it's diff: http://pastebin.com/YHjW6AHp

On my machine the C and C++ versions essentially have the same speed.

FabricPaul · on Nov 13, 2011

I've had a few email/tweet exchanges regarding KL, so I thought it would be useful to clarify a few things:

- KL is a language with a syntax that is very close to Javascript. It borrows syntax from JavaScript, but not the rules of the language itself. Just like JavaScript borrows syntax from C, and OpenCL borrows from C.

- There are many things in Javascript we don't support. Some things we don't currently support (eg. in-line initialization of arrays) but will probably support in the future; other things we probably won't support (eg. regular expressions as language objects) and finally there are things we will never support (ie. closures).

- There are nice features of KL that Javascript does not have, for instance arithmetic operator overloading. These features are included because they are particularly useful for computational problems.

- KL is not JavaScript++ - It's designed for writing high-performance operator code, not to handle everything that JavaScript can do. We don't want to reinvent the wheel :)

We will work on a post to cover KL in more detail, including roadmap. You can email info at fabric-engine dot com if you have any questions you want to take offline.

Thanks, Paul

Srirangan · on Nov 12, 2011

Extremely interesting. Since we're bench marking it with C++, is it safe to assume it performs much better than the JVM?

spullara · on Nov 13, 2011

Yes, I ported it and though the JVM performed well, it is beaten by the optimizing C++ compiler.

https://github.com/spullara/Benchmarks/blob/master/Server/Va...

apaprocki · on Nov 12, 2011

While this is interesting, not all CPUs are created equal. I can run the var-mt.cpp example on POWER hardware in:

  real    0m25.680s

My point being, that node.js doesn't work on this box, nor does Fabric. So when developing specialized algorithms you sometimes might want to run them in specialized environments.

FabricPaul · on Nov 12, 2011

Right - we would have to build a version of the engine to run on that hardware. Once that's done though, the applications would run. i.e. we prototyped on ARM earlier this year, and our unit tests worked.

That said - if you're building for specialized environments, you're probably going to want to hand-optimize code rather than rely on LLVM to do it. LLVM does a pretty good job though :)

ravloony · on Nov 12, 2011

So this software involves installing a plugin in the browser? In an age where users are being warned by the browser makers themselves to be careful about that sort of thing? I can't see this ending well.

FabricPaul · on Nov 12, 2011

Hi there - it depends on who you're targeting. Our client-side focus is on native developers looking for high-performance in web applications e.g. medical visualization http://vimeo.com/31970502 - the benefits of web applications are great, but performance is the current limiting factor.

We are likely to build support for NaCl in the future.

On the server-side, there is no requirement for a client to install a plug-in - it's for high-performance on the server i.e. semantic analysis, compute bound problems etc

Last point - we also think a lot about hybrid models where performance can be accessed on both client and server. This is much longer term, but the design of Fabric allows for it.

FabricPaul · on Nov 12, 2011

p.s. you can see our sample client-side apps here: http://vimeo.com/groups/fabric/videos and play with them at http://demos.fabric-engine.com

You can how we've extended Fabric to include existing C++ libraries - great for custom data types, streaming data etc

It is not really a consumer web technology - we're acutely aware of the plug-in friction/antipathy. However - if you want HPC in the browser, this is one of the only ways to do it.

allertonm · on Nov 12, 2011

In this instance - since this is a node app - they are running the engine server side.

But yes, they do have a plugin version of Fabric. I have similar misgivings about that.

spullara · on Nov 12, 2011

Porting the C++ code directly to Java:

macpro:ValueAtRisk sam$ time java -cp . VarMT

VaR = -43.7173372179254300

real 0m36.863s user 4m48.944s sys 0m0.978s

macpro:ValueAtRisk sam$ time ./var-mt

VaR = -43.7173372179254329

real 0m27.048s user 3m33.971s sys 0m0.145s

Pretty good showing for such a low level benchmark.

https://github.com/spullara/Benchmarks/blob/master/Server/Va...

It would be moderately interesting to see how well this benchmark would do in OpenCL or the like.

(Benchmarks ran on a 3.33 GHz, 6-core MacPro, JDK 7 Developer Preview)

FabricPaul · on Nov 12, 2011

cool :) Thanks for doing that - we can merge in at later date and include Java results.

We expose OpenCL in Fabric as an extension, as we wanted to be able to target heterogeneous hardware architectures. We didn't use this for benchmarking as we wanted to show CPU performance first. For clarity - KL does not compile down to OpenCL, you have to write for the GPU explicitly.

berkut · on Nov 12, 2011

Could probably squeeze a bit more out of the C++ version by targeting the specific architecture of the CPU to make use of SSE.

Also, what floating point type is KL using? float or double? - and is double necessary? - converting the C++ code to use floats would probably provide a fair speedup on the divides and due to squeezing more data into cache lines...

gcp · on Nov 12, 2011

Could probably squeeze a bit more out of the C++ version by targeting the specific architecture of the CPU to make use of SSE.

Not only that, from browsing the code, the critical loop is likely matrix multiplication. If that's the case, any kind of engine who is smart about SSE, cache lines, etc is going to be able to outperform simple C/C++ code.

Of course there's excellent matrix maths libraries for C/C++ that could be used instead.

to3m · on Nov 12, 2011

More than half of the running time seems to be taken up by the generation of normally-distributed random numbers. Sort of makes sense, I suppose, since that bit has a loop and a `sqrt' and a `log' in it.

The repeated calls to `exp' seem to take up some time too.

As for the matrix multiplication, that only happens on startup, so it's surely irrelevant. The bit that runs a lot just does matrix*vector. It is rather hard to make that cache-incoherent, as it just walks forwards through all inputs and outputs. In any event I would think that the program's entire working set will fit in L1.

I was merely fiddling with this out of interest, so I didn't spend ages SSE2ifying it. The VC++ x64 compiler doesn't do inline assembly language anyway. But if you halve the number of multiply-adds `multMatVec' does, under the assumption that this would make it twice as fast, and that twice as fast would be what an SSE2 implementation would be like, it makes no noticeable difference.

(I was fiddling with the single-threaded version, using Visual Studio 2010, compiling for x64.)

berkut · on Nov 12, 2011

Actually seems like it doesn't...:

Using: -O3 -march=corei7 -msse -msse2 -msse3 -msse4 -fipa-matrix-reorg -fwhole-program

made negligible (if any) difference running var-mt:

Original: 30.464 30.211 30.646

with g++ options: 30.381 30.487 30.277

However, using ICC 12.0.4 (-o3 -xssse3) 25.418 25.613 25.392

On the same machine (SB 2.2Ghz).

tantalor · on Nov 12, 2011

Is Fabric Engine commercial software? If so, the authors of this post ought to disclose their interests as the owners.

FabricPaul · on Nov 12, 2011

Hi there - apologies. Fabric is a commercial company. Fabric Engine will be free for non-commercial use - we're currently in beta so pricing is not yet finalized. I assumed the username 'FabricPaul' was a good indication, but thanks for calling it out.

foobarbazetc · on Nov 13, 2011

"free for non-commercial use" == commercial.

agoder · on Nov 17, 2011

In a quick test I did, using a better compiler (gcc 4.6.2 or icc) makes it 15% faster.

FabricPaul · on Nov 17, 2011

Cool - can you contribute the code to the repository so we can test and merge it in? Thanks

nphase · on Nov 12, 2011

Fabric Server isn't available to play with?

FabricPaul · on Nov 12, 2011

Hi there - not yet. We are going to start the alpha release very soon though. You can play with the client-side stuff though - it's essentially the same system. If you sign up to the newsletter you will get notification when we drop the FE Server stuff