That 285 is listed as (2:1 sparse) which means it's only valid for matrices where 2 out of every 4 numbers are zero. For dense matrices it's half that.
Are 2:1 sparse matrices a common thing? It seems weird, like clearly that’s not sparse enough to want to use, like, sparse matrix “CSR” style storage or something, haha. I would just treat it as dense I guess.
They aren't. As far as I can tell, Nvidia does this to be able to double the number of TFlops they put on their website. (this might be a little unfair, the real reason is that in ML it might be possible to train a NN such that your matrices have this structure, but I haven't seen anyone other than Nvidia use it)
What you might do is train using dense matrices, then sparsify those (pick the 2 out of each set of 4 weights that are closest to zero, mask them out), then do a few more training iterations with the mask in place.
It turns out that even without the extra training iterations you often lose surprisingly little in terms of quality of output. In reality you can sparsify a lot more, but 2 out of 4 is so simple and easy to implement in hardware, more complex schemes are much harder to justify.
However, small matmuls (say, <2048 bytes in the K dimension) won't get anywhere near 2x performance.
I’m trying to think of cases where it might accidentally come up, and all I can think of is something like “oops I used complex but my values are actually real.”
The 1.5 here is for a single core, though. So if we assume that the performance core on an M1 is around 7.5 watts (I’m not actually sure, seems like a reasonable upper bound though if a whole M1 mini is around 39 watts), we’d be looking at around 750 watts to match. Which seems like a surprisingly non-crazy amount of power given these are 32 bit flops, unlike the 16 in the RTX 3090, and they come from a CPU.
I tried gemm-benchmark on my M1 Max, and it took 22W to hit 2.2 Tflops with AMX (accelerate) or 36W to hit 270 GFlops with NEON (OpenBLAS)
So that's actually just about as power-efficient for fp32 as a 3090, which according to wikipedia is 35 Tflops in 350W. Supposedly AMX can do 2x rate for fp16 as opposed to the 3090's 4x rate, so maybe 2x less efficient than a 3090 for fp16.