Not up to date on a lot of "AI"/"ML" things, why isn't this significant for medi...

lostmsu · on Jan 5, 2023

RTX 3090 theoretical matmul is 142 TFlops. E.g. about 100x of this.

johndough · on Jan 5, 2023

The RTX 3090 has 35.58 TFlops FP32 performance, or 285.48 FP16 according to https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_proces...

EDIT: I fell for NVIDIA's marketing. The dense FP16 performance is only half of 284.48, which is 142. Thanks to adgjlsfhk1 for the correction.

adgjlsfhk1 · on Jan 5, 2023

That 285 is listed as (2:1 sparse) which means it's only valid for matrices where 2 out of every 4 numbers are zero. For dense matrices it's half that.

bee_rider · on Jan 5, 2023

Are 2:1 sparse matrices a common thing? It seems weird, like clearly that’s not sparse enough to want to use, like, sparse matrix “CSR” style storage or something, haha. I would just treat it as dense I guess.

adgjlsfhk1 · on Jan 5, 2023

They aren't. As far as I can tell, Nvidia does this to be able to double the number of TFlops they put on their website. (this might be a little unfair, the real reason is that in ML it might be possible to train a NN such that your matrices have this structure, but I haven't seen anyone other than Nvidia use it)

Firadeoclus · on Jan 6, 2023

What you might do is train using dense matrices, then sparsify those (pick the 2 out of each set of 4 weights that are closest to zero, mask them out), then do a few more training iterations with the mask in place.

It turns out that even without the extra training iterations you often lose surprisingly little in terms of quality of output. In reality you can sparsify a lot more, but 2 out of 4 is so simple and easy to implement in hardware, more complex schemes are much harder to justify.

However, small matmuls (say, <2048 bytes in the K dimension) won't get anywhere near 2x performance.

bee_rider · on Jan 5, 2023

I’m trying to think of cases where it might accidentally come up, and all I can think of is something like “oops I used complex but my values are actually real.”

dotnet00 · on Jan 5, 2023

There has been some work in that direction but it hasn't really caught on as fast as NVIDIA may have expected it to.

lostmsu · on Jan 5, 2023

Yeah, still waiting for this feature to be available in PyTorch natively.

bee_rider · on Jan 5, 2023

The 1.5 here is for a single core, though. So if we assume that the performance core on an M1 is around 7.5 watts (I’m not actually sure, seems like a reasonable upper bound though if a whole M1 mini is around 39 watts), we’d be looking at around 750 watts to match. Which seems like a surprisingly non-crazy amount of power given these are 32 bit flops, unlike the 16 in the RTX 3090, and they come from a CPU.

lostmsu · on Jan 5, 2023

This code runs on AMX co-processor. From the article:

> An important distinction is that the AMX:CPU ratio is not 1:1; not every core has its own AMX co-processor.

My understanding is there's only 1 of those per regular M1 CPU, maybe 4 on the largest one (Ultra).

brigade · on Jan 5, 2023

I tried gemm-benchmark on my M1 Max, and it took 22W to hit 2.2 Tflops with AMX (accelerate) or 36W to hit 270 GFlops with NEON (OpenBLAS)

So that's actually just about as power-efficient for fp32 as a 3090, which according to wikipedia is 35 Tflops in 350W. Supposedly AMX can do 2x rate for fp16 as opposed to the 3090's 4x rate, so maybe 2x less efficient than a 3090 for fp16.

Interestingly, fp64 hits 370 Gflops at 15W...