I know a fair amount about this problem, my last startup built a working prototype of a performance-portable deep learning framework that got good performance out of AMD cards. The compiler stack is way harder than most people appreciate because scheduling operations for GPUs is very specific to the workload, hardware, and app constraints. The two strongest companies I'm aware of that are working in this area now are Modular.AI and OctoML. On the new chip side Cerebras and Tenstorrent both look quite interesting. It's pretty hard to really beat NVIDIA for developer support though, they've invested a lot of work into the CUDA ecosystem over the years and it shows.
This. Modular and OctoML are building on top of MLIR and TVM respectively.
> It's pretty hard to really beat NVIDIA for developer support though, they've invested a lot of work into the CUDA ecosystem over the years and it shows.
Yup, strong CUDA community and dev support. That said, more ergonomic domain specific languages like Mojo might finally give CUDA some competition though - it's still a very high bar for sure.
Sure, but the point is that Triton is not dependent on CUDA language or frontend. Triton also outputs PTX using LLVM's NVPTX backend. Devils are in the details, but at a very high level, Triton could be ported to AMD by doing s/NVPTX/AMDGPU/. Given this, people should think again when they say NVIDIA has CUDA moat.
It also looks like they added MLIR backend to Triton though I wonder if Mojo has advantages since it was designed with MLIR in mind? https://github.com/openai/triton/pull/1004
I hadn't looked at Triton before, I took a quick look at it and how it's getting used in PyTorch 2. My read is it really lowers the barrier to doing new hardware ports, I think a team of around five people within a chip vendor's team could maintain a high quality port of PyTorch for a non-NVIDIA platform. That's less than it used to be, very cool. The approach would not be to use any of the PTX stuff, but to bolt on support for say the vendor's supported flavor of Vulkan.
This seems pretty reasonable and matches my suspicions. It is not hard for me to believe that CUDA has a lot of momentum behind it, not just in users, but in optimization and development. And thanks, I'll look more at Octo. As for Modular, aren't they only CPU right now? I'm not impressed by their results, as their edge isn't strong over PyTorch, especially scaling. A big reason this is surprising to me is simply how much faster numpy functions are than torch. Like just speed test np.sqrt(np.random.random(256, 1024)) vs torch.sqrt(torch.random(256, 1024)). Hell, np.sqrt(x) is also a lot slower than math.sqrt(x). It just seems like there's a lot of availability for optimization, but I'm sure there are costs.
When we're presented with problems where the two potential answers are "it's a lot harder than it looks" and "the people working on it are idiots" I tend to lean towards the former. But hey, when it is the latter there's usually a good market opportunity. Just I've found that domain expertise is seeing the nuance that you miss when looking at 10k ft.
First you have to figure out what problem to attack. Research, training production models, and production inference all have very different needs on the software side. Then you have to work out what the decision tree is for your customers (so depends who you are in this equation) and how you can solve some important problem for them. In all of this for say training a big transformer numpy isn't going to help you much so it doesn't matter if it's faster for some small cases. If you want to support a lot of model flexibility (for research and maybe training) then you need to do some combination of hand-writing chip-specific kernels and building a compiler that can do some or most of that automatically. Behind that door is a whole world of hardware-specific scheduling models, polyhedral optimization, horizontal and vertical fusion, sparsity, etc, etc, etc. It's a big and sustained engineering effort, not within the reach of hobby developers, so you go back to the question of who is paying for all this work and why. Nvidia has clarity there and some answers that are working. Historically AMD has operated on the theory that deep learning is too early/small to matter, and for big HPC deployments they can hand-craft whatever tools they need for those specific contracts (this is why ROCm seems so broken for normal people). Google built TensorFlow, XLA, Jax, etc for their own workloads and the priorities reflect that (e.g. TPU support). For a long time the great majority of inference workloads were on Intel CPUs so their software then reflected that. Not sure what tiny corp's bet here is going to be.
The change in the landscape I see now is that the models are big enough and useful enough that the commercial appetite for inference is expanding rapidly, hardware supply will continue to be constrained, and so tools that can reduce production inference cost by a percentage are starting to become a straight forward sale (and thus justify the infrastructure investment). This is not based on any inside info but when I look at companies like Modular and Octo that's a big part of why I think they probably will have some success.