Could probably squeeze a bit more out of the C++ version by targeting the specific architecture of the CPU to make use of SSE.
Not only that, from browsing the code, the critical loop is likely matrix multiplication. If that's the case, any kind of engine who is smart about SSE, cache lines, etc is going to be able to outperform simple C/C++ code.
Of course there's excellent matrix maths libraries for C/C++ that could be used instead.
More than half of the running time seems to be taken up by the
generation of normally-distributed random numbers. Sort of makes
sense, I suppose, since that bit has a loop and a `sqrt' and a `log'
in it.
The repeated calls to `exp' seem to take up some time too.
As for the matrix multiplication, that only happens on startup, so
it's surely irrelevant. The bit that runs a lot just does
matrix*vector. It is rather hard to make that cache-incoherent, as it
just walks forwards through all inputs and outputs. In any event I would think that the program's entire working set will fit in L1.
I was merely fiddling with this out of interest, so I didn't spend
ages SSE2ifying it. The VC++ x64 compiler doesn't do inline assembly
language anyway. But if you halve the number of multiply-adds
`multMatVec' does, under the assumption that this would make it twice
as fast, and that twice as fast would be what an SSE2 implementation
would be like, it makes no noticeable difference.
(I was fiddling with the single-threaded version, using Visual Studio
2010, compiling for x64.)
Not only that, from browsing the code, the critical loop is likely matrix multiplication. If that's the case, any kind of engine who is smart about SSE, cache lines, etc is going to be able to outperform simple C/C++ code.
Of course there's excellent matrix maths libraries for C/C++ that could be used instead.