Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> There’s a [Radeon RX 7900 XTX 24GB] already on the market. For $999, you get a 123 TFLOP card with 24 GB of 960 GB/s RAM. This is the best FLOPS per dollar today, and yet…nobody in ML uses it.

> I promise it’s better than the chip you taped out! It has 58B transistors on TSMC N5, and it’s like the 20th generation chip made by the company, 3rd in this series. Why are you so arrogant that you think you can make a better chip? And then, if no one uses this one, why would they use yours?

> So why does no one use it? The software is terrible!

> Forget all that software. The RDNA3 Instruction Set is well documented. The hardware is great. We are going to write our own software.

So why not just fix AMD accelerators in pytorch? Both ROCm and pytorch are open sourced. Isn't the point of the OSS community to use the community to solve problems? Shouldn't this be the killer advantage over CUDA? Making a new library doesn't democratize access to the 123 (fp16-)TFLOP accelerator. You fix pytorch and suddenly all the existing code has access to these accelerators. Millions of people now have This then puts significant pressure on Nvidia, as they can't corner the DL market. But it is a catch-22 because the DL market already is mostly Nvidia so it takes priority. Isn't this EXACTLY where OSS is supposed to help? I get Hotz wants to make money, and there's nothing wrong with that (it also complements his other company), but the arguments here seem more for fixing ROCm and specifically the pytorch implementation.

The mission is great, but AMD is in a much better position to compete with AMD. They caught up in the gamer's market (mostly) but have a long way to go for scientific work (which is what Nvidia is shifting focus to). This is realistically the only way to drive GPU prices down. Intel tried their hand (including in supercomputers) but failed too. I have to think there's a reason that's not obvious to most of us as to why this is happening.

Note 1:

I will add that supercomputers like Frontier (current #1) do use AMDs and a lot of the hope has been that this will fund the optimization from two places: 1) DOE optimizing their own code because that's the machine that they have access to and 2) AMD using the contract money to hire more devs. But this doesn't seem to be happening fast enough (I know some grad students working on ROCm).

Note 2:

There's a clear difference in how AMD and Nvidia measure TFLOPS. techpowerup shows AMD at 2-3x Nvidia, but performance is similar. Either AMD is crazy underutilized or something is wrong. Does anyone know the answer?



I know a fair amount about this problem, my last startup built a working prototype of a performance-portable deep learning framework that got good performance out of AMD cards. The compiler stack is way harder than most people appreciate because scheduling operations for GPUs is very specific to the workload, hardware, and app constraints. The two strongest companies I'm aware of that are working in this area now are Modular.AI and OctoML. On the new chip side Cerebras and Tenstorrent both look quite interesting. It's pretty hard to really beat NVIDIA for developer support though, they've invested a lot of work into the CUDA ecosystem over the years and it shows.


This. Modular and OctoML are building on top of MLIR and TVM respectively.

> It's pretty hard to really beat NVIDIA for developer support though, they've invested a lot of work into the CUDA ecosystem over the years and it shows.

Yup, strong CUDA community and dev support. That said, more ergonomic domain specific languages like Mojo might finally give CUDA some competition though - it's still a very high bar for sure.


There's also OpenAI Triton. People seem to miss that OpenAI is not using CUDA...


Yeah, also see AMD engineers working on Triton support here: https://github.com/openai/triton/issues/46


Triton outputs PTX which still requires CUDA to be installed.


Sure, but the point is that Triton is not dependent on CUDA language or frontend. Triton also outputs PTX using LLVM's NVPTX backend. Devils are in the details, but at a very high level, Triton could be ported to AMD by doing s/NVPTX/AMDGPU/. Given this, people should think again when they say NVIDIA has CUDA moat.


I thought this was a good overview of the idea Triton can circumvent the CUDA moat: https://www.semianalysis.com/p/nvidiaopenaitritonpytorch

It also looks like they added MLIR backend to Triton though I wonder if Mojo has advantages since it was designed with MLIR in mind? https://github.com/openai/triton/pull/1004


I hadn't looked at Triton before, I took a quick look at it and how it's getting used in PyTorch 2. My read is it really lowers the barrier to doing new hardware ports, I think a team of around five people within a chip vendor's team could maintain a high quality port of PyTorch for a non-NVIDIA platform. That's less than it used to be, very cool. The approach would not be to use any of the PTX stuff, but to bolt on support for say the vendor's supported flavor of Vulkan.


This seems pretty reasonable and matches my suspicions. It is not hard for me to believe that CUDA has a lot of momentum behind it, not just in users, but in optimization and development. And thanks, I'll look more at Octo. As for Modular, aren't they only CPU right now? I'm not impressed by their results, as their edge isn't strong over PyTorch, especially scaling. A big reason this is surprising to me is simply how much faster numpy functions are than torch. Like just speed test np.sqrt(np.random.random(256, 1024)) vs torch.sqrt(torch.random(256, 1024)). Hell, np.sqrt(x) is also a lot slower than math.sqrt(x). It just seems like there's a lot of availability for optimization, but I'm sure there are costs.

When we're presented with problems where the two potential answers are "it's a lot harder than it looks" and "the people working on it are idiots" I tend to lean towards the former. But hey, when it is the latter there's usually a good market opportunity. Just I've found that domain expertise is seeing the nuance that you miss when looking at 10k ft.


First you have to figure out what problem to attack. Research, training production models, and production inference all have very different needs on the software side. Then you have to work out what the decision tree is for your customers (so depends who you are in this equation) and how you can solve some important problem for them. In all of this for say training a big transformer numpy isn't going to help you much so it doesn't matter if it's faster for some small cases. If you want to support a lot of model flexibility (for research and maybe training) then you need to do some combination of hand-writing chip-specific kernels and building a compiler that can do some or most of that automatically. Behind that door is a whole world of hardware-specific scheduling models, polyhedral optimization, horizontal and vertical fusion, sparsity, etc, etc, etc. It's a big and sustained engineering effort, not within the reach of hobby developers, so you go back to the question of who is paying for all this work and why. Nvidia has clarity there and some answers that are working. Historically AMD has operated on the theory that deep learning is too early/small to matter, and for big HPC deployments they can hand-craft whatever tools they need for those specific contracts (this is why ROCm seems so broken for normal people). Google built TensorFlow, XLA, Jax, etc for their own workloads and the priorities reflect that (e.g. TPU support). For a long time the great majority of inference workloads were on Intel CPUs so their software then reflected that. Not sure what tiny corp's bet here is going to be.

The change in the landscape I see now is that the models are big enough and useful enough that the commercial appetite for inference is expanding rapidly, hardware supply will continue to be constrained, and so tools that can reduce production inference cost by a percentage are starting to become a straight forward sale (and thus justify the infrastructure investment). This is not based on any inside info but when I look at companies like Modular and Octo that's a big part of why I think they probably will have some success.


> So why not just fix AMD accelerators in pytorch? Both ROCm and pytorch are open sourced. Isn't the point of the OSS community to use the community to solve problems?

Because there's no real evidence that AMD cares about this problem, and without them caring your efforts may well be replaced by whatever AMD does next in the space. Their Brooks language[1] is abandoned, OpenCL doesn't compare well, ROCm is like the Sharepoint of GPU APIs (it ticks boxes but doesn't actually work very well).

> So why not just fix AMD accelerators in pytorch

Why not just buy NVidia? They care deeply about the space, will actually help you if you have trouble, etc etc.

Even using Google TPUs is better: Google will help you too.

While everyone using NVidia isn't great for the market as a whole as an individual company or person it makes a lot of sense.

Read "The Red Team (AMD)" section in the linked article:

> The software is called ROCm, it’s open source, and supposedly it works with PyTorch. Though I’ve tried 3 times in the last couple years to build it, and every time it didn’t build out of the box, I struggled to fix it, got it built, and it either segfaulted or returned the wrong answer. In comparison, I have probably built CUDA PyTorch 10 times and never had a single issue.

This is geohot. He knows how to build software, and how to fix problems.

Note that "Our short term goal is to get AMD on MLPerf using the tinygrad framework."

> There's a clear difference in how AMD and Nvidia measure TFLOPS. techpowerup shows AMD at 2-3x Nvidia, but performance is similar. Either AMD is crazy underutilized or something is wrong. Does anyone know the answer?

From the linked article:

> That’s the kernel space, the user space isn’t better. The compiler is so bad that clpeak only gets half the max possible FLOPS. And clpeak is a completely contrived workload attempting to maximize FLOPS, never mind how many FLOPS you get on a real program

[1] https://en.wikipedia.org/wiki/BrookGPU


> This is geohot. He knows how to build software, and how to fix problems.

This is a nonsequitor. This feels like when my uncle learns that I know how to program he asks me to build a website. These are two different things. I do ML and scientific computing, I'm not your guy. Hotz is a wiz kid but why should we expect his talents to be universal? Generalists don't exist.

And we're talking the guy who tweeted about believing that the integers and reals have the same cardinality right? Between that and his tweets on quantum we definitely have strong evidence that his jailbreaking skills don't translate to math or physics.

He's clearly good at what he does. There's no doubt about that. But why should I believe that his skills translate to other domains?

STOP MAKING GODS OUT OF MEN. Seriously, can we stop this? What does Stanning accomplish? It's creepy. It's creepy if it is BTS, Bieber, Elon, Robert Downey Jr, or Hotz.

> Read "The Red Team (AMD)" section in the linked article:

Clearly I did, I quoted from it. You quoted from the next section (So why does no one use it?).


geohot wrote tinygrad. This is not about believing his skills to translate to other domains. It is his domain.

You definitely shouldn't trust what geohot says about infinitary mathematics or (god forbids) quantum mechanics. On the other hand, you generally should trust what he says about machine learning software stack.


Tinygrad isn't a big selling point. I'd expect most people to be able to build something similar after watching Karpathy's micrograd tutorial. Tinygrad doesn't mean expertise in ML and it similarly doesn't mean expertise in accelerator programming. I wouldn't expect a front end developer to understand Template Metaprogramming and I wouldn't expect an engineer who programs acoustic simulations to be good at front end. You act like there are actually fullstack developers and not just people who do both poorly.

This project isn't even about skill in ML, which demonstrates misunderstandings. The project requires writing accelerator code. Go learn CUDA and tell me how different it is. It isn't something you're going to pick up in a weekend, or a month, and realistically not even a year. A lot of people can write kernels, not a lot of people can do it well.


> You act like there are actually fullstack developers and not just people who do both poorly.

If you haven't worked with someone who's smarter and more motivated than you are, then I can see how you'd draw that conclusion, but if you have, then you'd know that there are full stack developers out there who do both better than you. It's humbling to code in their repos. I've never worked with geohot so I don't know if he is such a person, but they're out there.


> Hotz is a wiz kid but why should we expect his talents to be universal?

No of course not. But this is literally his field of expertise, and there's plenty of reasons to think he knows what he is doing. Specifically, the combination of reverse engineering and writing ML libraries means I'd certainly expect he's had reasonable experience compiling things.


It's often less work to start from scratch than to fix an extremely complex broken stack. Of course people also say this when they just want to start from scratch.

RDNA 3 has dual-issue that basically isn't used by the compiler so half the FPUs are idle.


> So why not just fix AMD accelerators in pytorch?

It doesn't fit the business model. I mean sure, they'll sell AMD computers now like a bootleg Puget Systems. But why buy from the bootleg when I can just buy from the real thing (or AWS) and run tinygrad on there if I want?

So the play is, get people using your framework (tinygrad), then pivot to making AI chips for it:

> In the limit, it’s a chip company, but there’s a lot of intermediates along the way.

Seems far fetched but good luck to them.


Nvidia has stuff like hardware sparsity support. Modern methods (RigL) can let you train sparse for a 2X speedup.

Memory bandwidth (sparsity helps) and networking connectivity (Nvidia bought Mellanox and other networking companies) are important too. They are also using a lot of die space on raytracing stuff that they don't waste on the datacenter versions presumably.


Intel Arc A770 16gb $349 had 40 TFLOP of FP16 which is close to 7900 xtx on the FLOPS/$ scale.

Intel's software is much better (MKL, vtune, etc for GPU) and getting better.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: