Nice that this will be open sourced. Maybe the true performance characteristics
can be better understood.
The press release seems exaggerated or written by marketing, e.g. claiming "accelerating optimization exponentially". Shouldn't the acceleration be in the order of the number of compute cores?
You could also check the world catalog to see if a library near you offers the ebook for lending. Universities typically allow the general public to walk in and look at books without registration.
i have, yes. i can't speak for openblas or mkl, but im familiar with eigen and nalgebra's implementations to some extent
nalgebra doesn't use blocking, so decompositions are handled one column (or row) at a time. this is great for small matrices, but scales poorly for larger ones
eigen uses blocking for most decompositions, other than the eigendecomposition, but they don't have a proper threading framework. the only operation that is properly multithreaded is matrix multiplication using openmp (and the unstable tensor module using a custom thread pool)
I skimmed your post and I wonder if mojo is focusing on such small 512x512 matrices? What is your thinking on generalizing your results for larger matrices?
I think for a compiler it makes sense to focus on small matrix multiplies, which are a building block of larger matrix multiplies anyways. Small matrix multiplies emphasize the compiler/code generation quality. Even vanilla python overhead might be insignificant when gluing small-ish matrix multiplies together to do a big multiply.
According to this crunchbase data [1] it has a lot per capita.
[1] https://news.crunchbase.com/startups/countries-most-startup-...