I am not very good with Python, so there is some chance I am doing something wro...

I am not very good with Python, so there is some chance I am doing something wrong. But my explanation of the results that I get is that PyTorch currently does not fully utilize FP16 + Metal or AMX when running on Apple Silicon. In contrast, my implementation stores the weights in 16-bit floating point precision (FP16) and also utilizes the AMX coprocessor through the Accelerate framework. As I mentioned in OP, the latter is very efficient for doing the matrix multiplications. According to my experiments, it is comparable in performance to running them on the Apple GPU via Metal.