It's definitely not CUDA advantage. If you can get Pytorch/flash attention/triton well supported in any hardware, a huge chunk of client don't care if it means cost saving. Case in point Google's TPU had extensive usage outside Google when they were cheaper for the same performance. Now that isn't the case.