Here are my early experiments at making a pure-julia multi-threaded BLAS using LoopVectorization.jl https://github.com/MasonProtter/Gaius.jl. It absolutely blows a naive triple for loop out of the water and is quite competitive against OpenBLAS until you get to very big sizes.
For small, statically sized arrays, LoopVectorization + tripple loops is also much faster than MArrays. LoopVectorization doesn't support SArrays yet, because you can't get pointers to them.
MArrays will be stack allocated if they don't escape.
One of my in development packages also uses it's own "stack" (MMap a chunk of memory), so that it can have pointers to fast "stack-alocated" arrays.
I played around with LLVM's alloca a bit, but it seems like I could only ever use a single alloca at a time; if I ever used more than one, LLVM would just return the same pointer each time instead of incrementing it.
If I have to manage incrementing the pointers myself anyway, I may as well use my own stack, too.
For (the problems I have tested and tuned it on), LoopVectorization produces faster code than C/Fortran, e.g.:
https://chriselrod.github.io/LoopVectorization.jl/latest/exa...
But it may be more fair to compare it with plutocc. In my early tests (which involved much larger problem sizes), plutocc does a lot better, because (unlike LoopVectorization) it seems to consider memory/caches rather than just registers allocation and instruction costs.
Here are my early experiments at making a pure-julia multi-threaded BLAS using LoopVectorization.jl https://github.com/MasonProtter/Gaius.jl. It absolutely blows a naive triple for loop out of the water and is quite competitive against OpenBLAS until you get to very big sizes.