*Could probably squeeze a bit more out of the C++ version by targeting the speci...

to3m · on Nov 12, 2011

More than half of the running time seems to be taken up by the generation of normally-distributed random numbers. Sort of makes sense, I suppose, since that bit has a loop and a `sqrt' and a `log' in it.

The repeated calls to `exp' seem to take up some time too.

As for the matrix multiplication, that only happens on startup, so it's surely irrelevant. The bit that runs a lot just does matrix*vector. It is rather hard to make that cache-incoherent, as it just walks forwards through all inputs and outputs. In any event I would think that the program's entire working set will fit in L1.

I was merely fiddling with this out of interest, so I didn't spend ages SSE2ifying it. The VC++ x64 compiler doesn't do inline assembly language anyway. But if you halve the number of multiply-adds `multMatVec' does, under the assumption that this would make it twice as fast, and that twice as fast would be what an SSE2 implementation would be like, it makes no noticeable difference.

(I was fiddling with the single-threaded version, using Visual Studio 2010, compiling for x64.)