Important precision: the async part is absolutely not python specific, but comes...

godelski · 2024-10-17T01:13:39 1729127619

Yes, it isn't python, it is... hardware. Not even CUDA specific. It is about memory moving around and optimization (remember, even the CPUs do speculative execution). I say a little more in the larger comment.

I'm less concerned about the CPU baseline and more concerned about the NPU timing. Especially given the other issues