Does it not work for them? Where can I learn why?

tucnak · on April 9, 2024

Just go have a look around Github issues in their ROCm repositories on Github. A few months back the top excuse re: AMD was that we're not supposed to use their "consumer" cards, however the datacenter stuff is kosher. Well, guess what, we have purchased their datacenter card, MI50, and it's similarly screwed. Too many bugs in the kernel, kernel crashes, hangs, and the ROCm code is buggy / incomplete. When it works, it works for a short period of time, and yes HBM memory is kind of nice, but the whole thing is not worth it. Some say MI210 and MI300 are better, but it's just wishful thinking as all the bugs are in the software, kernel driver, and firmware. I have spent too many hours troubleshooting entry-level datacenter-grade Instinct cards with no recourse from AMD whatsoever to pay 10+ thousands for MI210 a couple-year old underpowered hardware, and MI300 is just unavailable.

Not even from cloud providers which should be telling enough.

JonChesterfield · on April 9, 2024

We absolutely hammered the MI50 in internal testing for ages. Was solid as far as I can tell.

Rocm is sensitive to matching kernel version to driver version to userspace version. Staying very much on the kernel version from a official release and using the corresponding driver is drastically more robust than optimistically mixing different components. In particular, rocm is released and tested as one large blob, and running that large blob on a slightly different kernel version can go very badly. Mixing things from GitHub with things from your package manager is also optimistic.

Imagine it as huge ball of code where cross version compatibility of pieces is totally untested.

tucnak · on April 9, 2024

I would run simple llama.cpp batch jobs for 10 minutes when it would suddenly fail, and require a restart. Random VM_L2_PROTECTION_FAULT in dmesg, something having to do with doorbells. I did report this, never heard back from them.

JonChesterfield · on April 10, 2024

Did you run on the blessed Ubuntu version with the blessed kernel version and the blessed driver version? As otherwise you really are in a development branch.

If you can point me to a repro I'll add it to my todo list. You can probably tag me in the github issue if that's where you reported it.

FeepingCreature · on April 15, 2024

At least the one I run into, which also says stuff with L2 and doorbells, is https://github.com/ROCm/ROCm/issues/2196 fwiw.

Aissen · on April 10, 2024

> blessed Ubuntu version with the blessed kernel version

To an SRE, this is a nightmare to read. Cuda is bad in this regard (can often prevent major kernel version updates), but this is worse.

latchkey · on April 10, 2024

I feel like this goes both ways. You also don't want to have to run bleeding edge for everything because there are so many bugs in things. You kind of want known stable versions to at least base yourself off of.

latchkey · on April 9, 2024

Would you like to share the model of GPU and versions of various software used?

George has a nice explanation of doorbells:

https://youtu.be/AqPIOtUkxNo?feature=shared&t=968

FeepingCreature · on April 10, 2024

Same here with SD on 7900XTX. Most of the time for me it's sufficient to reset the card with rocm-smi --gpureset -d 0.

michaelt · on April 10, 2024

Only "most of the time" ? :(

You'd hope at $15,000+ per unit, you wouldn't have to reset it at all...

JonChesterfield · on April 10, 2024

It's $1000 per, no? This is one of the gaming cards.

FeepingCreature · on April 10, 2024

Yep, bought for $1000.

At which price point to be honest, it still shouldn't be needed.

AMD are lucky everyone expects this nowadays, or people might consider sueing.

Workaccount2 · on April 9, 2024

It's seriously impressive how well AMD has been able to maintain their incredible software deficiency for over a decade now.

alexey-salmin · on April 9, 2024

They deeply care about the tradition of ATI kernel modules from 2004

sumtechguy · on April 10, 2024

more like 1998 :)

amirhirsch · on April 9, 2024

Buying Xilinx helped a lot here.

fpgamlirfanboy · on April 10, 2024

it's so true it hurts

tucnak · on April 10, 2024

Hey man have seen you around here, very knowledgeable, thanks for your input!

What's your take on projects like https://github.com/corundum/corundum I'm trying to get better at FPGA design, perhaps learn PCIe and some such but Vivado is intimidating (as opposed to Yosys/nextpnr which you seem to hate) should I just get involved with a project like this to acclimatise somewhat?

fpgamlirfanboy · on April 10, 2024

> Vivado is intimidating (as opposed to Yosys/nextpnr which you seem to hate)

i never said i hated yosys/nextpnr? i said somewhere that yosys makes the uber strange decision to use C++ as effectively a scripting language ie gluing and scheduling "passes" together - like they seemed to make the firm decision to diverge from tcl but diverged into absurd territory. i wish yosys were great because it's open source and then i could solve my own problems as they occurred. but it's not great and i doubt it ever will be because building logic synthesis, techmapping, timing analysis, place and route, etc. is just too many extremely hard problems for OSS.

all the vendor tools suck. it's just a fact that both big fpga manufacturers have completely shit software devs working on those tools. the only tools that i've heard are decent are the very expensive suites from cadence/siemenns/synopsis but i have yet to be in a place that has licenses (neither school nor day job - at least not in my team). and mind you, you will still need to feed the RTL or netlist or whatever that those tools generate into vivado (so you're still fucked).

so i don't have advice for you on RTL - i moved one level up (ISA, compilers, etc.) primarily because i could not effectively learn by myself i.e., without going to "apprentice" under someone that just has enough experience to navigate around the potholes (because fundamentally if that's what it takes to learn then you're basically working on ineluctably entrenched tech).

jmward01 · on April 9, 2024

Yeah, this has stopped me from trying anything with them. They need to lead with their consumer cards so that developers can test/build/evaluate/gain trust locally and then their enterprise offerings need to 100% guarantee that the stuff developers worked on will work in the data center. I keep hoping to see this but every time I look it isn't there. There is way more support for apple silicon out there than ROCm and that has no path to enterprise. AMD is missing the boat.

JonChesterfield · on April 9, 2024

In fairness it wasn't Apple who implemented the non-mac uses of their hardware.

AMD's driver is in your kernel, all the userspace is on GitHub. The ISA is documented. It's entirely possible to treat the ASICs as mass market subsidized floating point machines and run your own code on them.

Modulo firmware. I'm vaguely on the path to working out what's going on there. Changing that without talking to the hardware guys in real time might be rather difficult even with the code available though.

imtringued · on April 10, 2024

You are ignoring that AMD doesn't use an intermediate representation and every ROCm driver is basically compiling to a GPU specific ISA. It wouldn't surprise me that there are bugs they have fixed for one ISA that they didn't bother porting to the others. The other problem is that most likely their firmware contains classic C bugs like buffer overflows, undefined behaviour, or stuff like deadlocks.

JonChesterfield · on April 10, 2024

This is sort of true. Graphics compiles to spir-v, moves that around as deployment, then runs it through llvm to create the compiled shaders. Compute doesn't bother with spir-v (to the distress of some of our engineers) and moves llvm IR around instead. That goes through the llvm backend which does mostly the same stuff for each target machine. There probably are some bugs that were fixed on one machine and accidentally missed on another - the compiler is quite branchy - but it's nothing like as bad as a separate codebase per ISA. Nvidia has a specific ISA per card too, they just expose PTX and SASS as abstractions over it.

I haven't found the firmware source code yet - digging through confluence and perforce tries my patience and I'm supposed to be working on llvm - but I hear it's written in assembly, where one of the hurdles to open sourcing it is the assembler is proprietary. I suspect there's some common information shared with the hardware description language (tcl and verilog or whatever they're using). To the extent that turns out to be true, it'll be immune to C style undefined behaviour, but I wouldn't bet on it being free from buffer overflows.

latchkey · on April 9, 2024

You are right, AMD should do more with consumer cards, but I understand why they aren't today. It is a big ship, they've really only started changing course as of last Oct/Nov, before the release of MI300x in Dec. If you have limited resources and a whole culture to change, you have to give them time to fix that.

That said, if you're on the inside, like I am, and you talk to people at AMD (just got off two separate back to back calls with them), rest assured, they are dedicated to making this stuff work.

Part of that is to build a developer flywheel by making their top end hardware available to end users. That's where my company Hot Aisle comes into play. Something that wasn't available before outside of the HPC markets, is now going to be made available.

jmward01 · on April 9, 2024

I look forward to seeing it. NVIDIA needs real competition for their own benefit if not the market as a whole. I want a richer ecosystem where Intel, AMD, NVIDIA and other players all join in with the winner being the consumer. From a selfish point of view I also want to do more home experimentation. LLMs are so new that you can make breakthroughs without a huge team but it really helps to have hardware to make it easier to play with ideas. Consumer card memory limitations are hurting that right now.

latchkey · on April 9, 2024

> I want a richer ecosystem where Intel, AMD, NVIDIA and other players all join in with the winner being the consumer.

This is exactly the void I'm trying to fill.

Hugsun · on April 14, 2024

I can't find your company, what are you trying to do to solve this?

latchkey · on April 14, 2024

Thanks for asking, it is right there in my profile... click on my username.

tucnak · on April 9, 2024

[flagged]

latchkey · on April 9, 2024

https://news.ycombinator.com/newsguidelines.html

Comments should get more thoughtful and substantive, not less, as a topic gets more divisive.

cavisne · on April 10, 2024

Yeah, I think AMD will really struggle with the cloud providers.

Even Nvidia GPU's are tricky to sandbox, and it sounds like the AMD cards are really easy for the tenant to break (or at least force a restart of the underlying host).

AWS does have a Gaudi instance which is interesting, but overall I don't see why Azure, AWS & Google would deploy AMD or Intel GPU's at scale vs their own chips.

They need some competitor to Nvidia to help negotiate, but if its going to be a painful software support story suited to only a few enterprise customers, why not do it with your own chip?

latchkey · on April 9, 2024

[flagged]

Scaevolus · on April 9, 2024

Effectively everyone that has attempted to use AMD hardware for ML comes away with these opinions, the main difference is how angrily they express it.

latchkey · on April 9, 2024

We are the 4th non-hyperscaler business on the planet to even get access to MI300x and we just got it in early March. From what I understand, hyperscalers have had fantastic uptake of this hardware.

I find it hard to believe "everyone" comes away with these opinions.

https://www.evp.cloud/post/diving-deeper-insights-from-our-l...

sangnoir · on April 9, 2024

So who's buying all the MI300s? Groq seems to be fine with AMD.

tucnak · on April 9, 2024

Groq doesn't use AMD afaik, they had designed hardware of their own, which is actually 1000s of SRAM chips in a trench-coat.

latchkey · on April 9, 2024

Speaking of SRAM, I found this relevant comment insightful:

https://news.ycombinator.com/item?id=39966620

latchkey · on April 9, 2024

> So who's buying all the MI300s?

We are. Just closed additional funding and getting quotes now.

buildbot · on April 9, 2024

Probably not, I have had the same experience with a 780m and a Mi60...