Eridrus's favorites | Hacker News

Sparse stuff is not efficient to compute unless it's really, really sparse, or at least partially dense in large enough chunks that your compute can efficiently do its job. If L1 hit is counting to three or four depending on the arch, a full cache miss is counting to 200+. If you miss your cache all the time (which with sparse stuff you will) things get really slow. And that's _before_ you consider that GPU programs can't really do different branches across threads, and non-coalesced memory access absolutely crushes their memory throughput, and CPUs have to blow their pipeline out on branch misprediction, so you want very predictable branches. It all looks good on paper, but most researchers do not have the engineering chops to validate these ideas in practice properly.