Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> about 2-3 times faster compared to the current PyTorch implementation

This is surprising to me. Is this about CPU only? Then it would make sense.

Also, is there a particular reason why the whole code is basically 2 massive files (3k and 8k lines respectively)?



I am not very good with Python, so there is some chance I am doing something wrong. But my explanation of the results that I get is that PyTorch currently does not fully utilize FP16 + Metal or AMX when running on Apple Silicon. In contrast, my implementation stores the weights in 16-bit floating point precision (FP16) and also utilizes the AMX coprocessor through the Accelerate framework. As I mentioned in OP, the latter is very efficient for doing the matrix multiplications. According to my experiments, it is comparable in performance to running them on the Apple GPU via Metal.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: