On the NVidia A100, the standard FP32 performance is 20 TFLOPs, but if you use t...

On the NVidia A100, the standard FP32 performance is 20 TFLOPs, but if you use the tensor cores and all the ML features available then it peaks out at 300+ TFLOPs. Not exactly your question, but a simple reference point.

Now the accelerator in the M1 is only 11 TFLOPs. So it’s definitely not trying to compete as an accelerator for training.