Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is this like the Karatsuba algorithm, where it's theoretically faster but not actually faster when run on real hardware?

Btw, it's worth noting that if you know that the result will be symmetric (such as is the case for X * X^T), you can make things faster. For example in cuBLAS, cublas*syrk (the variant optimized for when the result is symmetric) IME isn't faster than gemm, so what you can do instead is just do smaller multiplications that fill in one of the two triangles piece by piece, and then copy that triangle to the other one.



Karatsuba is definitely faster than schoolbook multiplication at practical sizes. You presumably mean Strassen.


They tested it against BLAS primitives and found theirs to be faster, in hardware.


The mention 5% improvements and small matrices in the abstract, so my gut says (I haven’t read the actual paper yet) that is probably is a practical-type algorithm.


Since this is 2x2 there's simd instructions that can do this (or with two simd dot products) both on CPU and inside each GPU core. So with current hardware you won't beat writing this out manually.


It’s faster for matrices of approximately at least ~256x256, though it depends on hardware


>where it's theoretically faster but not actually faster when run on real hardware

Aka a galactic algorithm:

https://en.wikipedia.org/wiki/Galactic_algorithm


I feel like a galactic algorithm is slow even theoretically, unless you work with out-of-this-world data.

Whereas many algorithms are theoretically fine for real data but don't make good use of cache, or instruction sets...


[flagged]


I figured it might, but I think that this is a top of mind question for people and would be nice to make clear in the comments of the post too. So often there’s some theoretical improvement on multiplication that isn’t actually practical. Regardless, they don’t seem to have posted results for CUDA, which is arguably more important than CPU multiplication which is what they tried


Probably why they tried to address it in the abstract already


Let's have more comments from domain experts comparing algorithms in the abstract to cuBLAS, and less like these. Thanks!

If they're wrong to speculate, well, there's a whole paper you can just go skim to find the bit that rebuts them.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: