What Is AMD ROCm?

blip54321 · on Nov 25, 2021

I tried ROCm. I bought a supported card (RX570/RX580 series). Within 12 months, AMD dropped support. Newer versions of ROCm didn't work with the card. Older versions didn't actually work either, since all other tooling assumed newer versions. Dependency hell. When things kinda started working in one context, where I could use old tooling (not the one I wanted to use ROCm in), it CUPy was slower than CPU, and then hard crashed my computer randomly. I read a web page that the card can either act as HIP or a graphics card, but not both at the same time. I have no idea if that's right, but if it is, it's dumb.

AMD had no support. Card maker said this didn't fall under warranty. I got burned over and over.

I bought NVidia. It just worked.

I'm working on a potentially major piece of infrastructure, and AMD is accumulating debt. If it worked out-of-the-gate, I imagine we would have kept support. Within 6 more months, we'll be NVidia-specific. AMD will be that much further in the hole for support.

I'd love for ROCm to win, since I think open is critical here. On the other hand, I can't imagine it will. AMD would need to run this as a loss leader for a while, and engineer this at a level to get this competitive with NVidia.

A half-baked product like ROCm seems like a money hole for everyone involved. Customers get burned, and I can't imagine AMD comes out positive.

In the meantime, NVidia is minting gold here.

jjoonathan · on Nov 26, 2021

Yep. This is the same lesson that I learned in 2013 and again in 2015 with OpenCL, both for professional apps and for programming. On the professional app side, there was a lot of "support" for OpenCL that had caveats severe enough to make it unusable. Sure, blender supports OpenCL! You just have to make a custom build and then the GPU target is slower than the CPU target. Sure, Adobe apps support OpenCL -- but any GPU render contexts are solid black boxes and forum posts indicate it has been that way for a year and nobody cared to fix it. Same thing with programming and debugging: there were slides suggesting feature parity with CUDA on a bunch of fronts that just weren't implemented or locked up the computer if you tried to actually use them. It got so bad that I walked away from my sunk costs, sold my AMD cards, ate the ebay tax, ate the nvidia tax, and bought cards that actually worked.

Now I'm in "twice bitten, once shy" mode with AMD. I hate paying the green tax as much as the next guy and I desperately want to have a second source of professional GPUs, but I'm not going to be the guinea pig. Not again. Not for the 3rd time. I want to see someone else successfully using AMD cards for common ML workflows and for blender before I even consider risking it again.

gspr · on Nov 26, 2021

I'm not nearly as invested as you, but for my first "real" GPU compute project of any significant size and impact, I shocked my colleagues and picked OpenCL. All our hardware is nVidia, but I thought I'd make an effort to fight that vendor lock-in. And I find OpenCL quite pleasant! But… my god. OpenCL is a second-class citizen (at best!) on all three of the major platforms. The situation is dire. But the solution can't be to leave the world to CUDA.

AnthonyMouse · on Nov 26, 2021

> AMD would need to run this as a loss leader for a while, and engineer this at a level to get this competitive with NVidia.

Their problem has been that as little as five years ago they were a dying company. They didn't have the resources to do this right.

That's no longer the case, but once you have the money there is still a lag between then and when the release funded by that money comes out. And even then they're fighting an uphill battle against the perceptions created during their dark age.

Probably the biggest thing they have working for them is Nvidia's behavior. Proprietary everything and single vendor lock in makes everybody chafe, so as soon as they can produce something usable, everyone will want to use it.

blip54321 · on Nov 27, 2021

I'd take "usable." If AMD was half the performance of NVidia, but open, stable, compatible, and robust, that would reach that bar for me, and I think for a lot of people.

I think that will be an increasingly hard bar to clear, as software becomes coupled to CUDA, though. AMD will be chasing a losing race. They won once with Intel, but this one feels harder....

belval · on Nov 25, 2021

Yeah Nvidia is so far ahead at this point I wouldn't really risk it on anything else. The problem is that all that troubleshooting adds up fast and the whole DL golden years were built on CUDA. TensorFlow and PyTorch both "support" ROCm and HIP but you run into weird issues very often. A lot of public repository for recent architectures also come with their own CUDA kernels that you need to compile to the vendor lock-in is very strong in my opinion.

And even if AMD's offering was not an absolute dumpster fire, Google, Microsoft and Amazon all have their own accelerator that are maturing and will be more cost effective on the long run.

jacquesm · on Nov 26, 2021

Intel has done pretty much the same thing with their Xeon-phi series. Promising on paper but without the long term support and dedication from the mothership such efforts are doomed to fail. NVidia really gets it: people care about results, not necessarily how they get there and if that is done in the most 'pure' way possible, if it 'just works' that's good enough.

sorenjan · on Nov 25, 2021

Khrono's SYCL is the open alternative to CUDA. It's what Intel is using for their oneAPI, there's a CUDA backend and even a HIP backend.

pjmlp · on Nov 26, 2021

It is not a real alternative, as it lacks the polyglot ecosystem and tooling from CUDA.

It it is the usual Khronos defines the base stuff and hopes for the best regarding their partners.

synergy20 · on Nov 26, 2021

Nice try, still no way to compare with CUDA, Nvidia is so much ahead and it's hopeless to catch up.

VHRanger · on Nov 25, 2021

It's something that has very little uptake because it's not supported on mainline GPUs?

I want to use it for compute on something like a rx 6800 and to my knowledge can't

rjzzleep · on Nov 25, 2021

I was going to post exactly the same. And if you look at the github issues of their project you will see that very often it looks outsourced support teams comment on these issues with the standard: we will discuss this internally and get back to you kind of response that enterprises twitter support usually gives.

Not really what you expect from quality engineering. At the end of the day these kind of companies don't understand the value of development and engineering clients as customers.

It's unfortunate really.

EDIT: here's an example:

   ROCmSupport commented on Feb 22 •

   Hi @powderluv
   Thanks for reaching us. I can not comment on RDNA2  support right now.
   We are working on adding a few more new hardware into ROCm environment.
   Please stay tuned via our documentation.
   Thank you.
   
   @ROCmSupport ROCmSupport closed this on Feb 22

https://github.com/RadeonOpenCompute/ROCm/issues/1390#issuec...

dylan604 · on Nov 25, 2021

Sounds like a race from support staff to see how many tickets they can close to make themselves look good for the PM at the next review

nicolaslem · on Nov 25, 2021

On the other hand AMD has a fraction of the engineering resources Intel and Nvidia have. They need to make some choice and looking back at the last few years it seems that their choice to focus their efforts on hardware and gaming paid off.

to11mtm · on Nov 25, 2021

ATI/AMD has always had finicky drivers and engineering decisions IMO.

I guess on the plus side they at least have a more open driver than NVidia (AFAIK nouveau doesn't get any support from them, at least AMD tries to maintain their open source driver on some level.)

And yet, every time I've tried an ATI/AMD Card, the driver experience even in windows has been pretty off-putting, and while I suppose we are finally at a point where one is less likely to be impacted by their issues with 768p overscan on TVs, I wonder what zany quirk they'll come up with next.

ahartmetz · on Nov 26, 2021

> AMD tries to maintain their open source driver on some level.

I think you somehow mistyped "AMD has open source drivers of absofuckinglutely excellent quality supporting hardware of the last ten years or so". No really, they are great. For graphics, that is.

dagmx · on Nov 25, 2021

On the flip side, I think that they haven't focused on a compelling compute story means that anyone doing anything other than pure gaming is better served by an Nvidia card.

edgyquant · on Nov 26, 2021

Yeah I bought an AMD card a few years when they released the new architecture to support them. I ended up grabbing an nvidia card because I don’t play many games but I want to be able to run tensorflow etc but after a year AMD still had little support for any machine learning.

jagger27 · on Nov 25, 2021

This is why I'm excited about Intel getting into the market.

jacquesm · on Nov 26, 2021

That's not a valid reason to close a ticket. You close a ticket because the matter is resolved. We work with an internal rule with respect to comments on documents: if you open it the default is that you are going to be the one to close it when you are satisfied your concerns have been addressed. This sometimes gets overruled but that definitely isn't the norm.

slavik81 · on Nov 25, 2021

It's not officially supported, but I think it would work if you installed the official ROCm 4.5 packages. The RX 6800 is listed as gfx1030 [1], which has been shipping in most libraries since ROCm 4.2. I've heard there were a few bugs, but I've been using it for months without encountering any issues myself.

(I work for AMD on ROCm. All opinions are my own.)

[1]: https://llvm.org/docs/AMDGPUUsage.html#processors

techdragon · on Nov 25, 2021

Can you please emphasise to your management chain how important the need for less terrible support and developer relations vis GitHub is. They can close support questions but basically anything in these repos gets closed as fast as possible even feature requests and other things that should be left open.

I doubt they have the funding to meaningfully impact the overall hardware and software support matrix but if they could just make the GitHub repo feel less like I’m back in my days working at a call centre raising tickets to a second level support team in a foreign country who’s only business KPI was tickets closed per day.

slavik81 · on Nov 25, 2021

I agree with you. The communication between AMD and the community has been less than ideal.

I think it's worth noting, though, it's not always as bad as the example in the sibling comment. The RadeonOpenCompute/ROCm repo catches a lot of questions about big features and the future direction of the project. Those are particularly difficult to answer as an engineer. As much as I'd like to, I can't make a product announcement in a GitHub issue.

If you have a specific technical problem and you open an issue on the repo for the corresponding component, you'll probably have a better experience. Some teams are more responsive than others, but that will at least maximize your odds of successful resolution.

dogma1138 · on Nov 25, 2021

It’s worst because it has no intermediate state there is no guarantee for forward compatibility (backward compatibility is also kinda broken). Shipping anything with HIP will be a pain.

With CUDA you simply target a specific CUDA version and there is full forward and backwards compatibility on any hardware that supports that version.

bubblethink · on Nov 25, 2021

AMD is much smaller in comparison, and their main focus with ROCM is to get pytorch and tensorflow to work with enterprise GPUs. Everything else is long tail in terms of scale.

Gigachad · on Nov 26, 2021

This doesn't mean anything to potential customers. If I was in the market for GPU compute, nvidia is the only real option. AMD has had some PR person commenting on github that they have "big news coming very soon" for navi users but it's been over a year now and still no news.

vardump · on Nov 25, 2021

Sadly this is why I (and many others) keep begrudgingly choosing Nvidia cards instead.

Once the developers are familiar with CUDA, what are the chances you'd choose ROCm for deployment? Yeah, not great.

Consumer cards' ROCm support is strategic.

enos_feedler · on Nov 26, 2021

If this is the goal they are better off working with Google on MLIR since they are also focusing on hw acceleration of pytorch and tensorflow code

zwaps · on Nov 26, 2021

AMD's support - not just for new or old GPUs, but in terms of all sorts of compatibility changes, usability parity and regressions is staggeringly bad.

Given that compute is important, and many people use their GPUs for compute of some sort, I simply can not understand how that part is so poorly executed on the part of AMD that, in terms of actual application, you might call it entirely absent.

I mean, this is a company that produces compute cards, which supposedly someone in the world must buy and use... but who? Why? I have never seen anyone, and for good reason. And it seems like AMD just... doesn't care?

Like, it's just not a part of their organizational strategy... Compute is on the powerpoint slides, but... no one (can) use it?

It's been going on for years now and I don't get it.

jacquesm · on Nov 26, 2021

Especially not with NVidia leaving the front door wide open with their licensing terms and other oversights. But as long as AMD doesn't care enough NVidia will grow and become more and more entrenched. It would take a small miracle to displace them by now.

a_bonobo · on Nov 26, 2021

I tink the main workhorse supercomputer GPUs are not out yet.

They announced the MI200 GPU with 128GB of memory, two supercomputers (Setonix, Frontier) are suppose to include them but both will only be launched next year https://www.tomshardware.com/news/setonix-supercomputer-mi20...

touisteur · on Nov 26, 2021

Also announced very recently the MI250 with 47 FP32 TFLOPS? Just the morning before the GTC keynote? NVIDIA Lovelace is supposed to be in 2022 (I think?) so it's a good time to ship GPGPU HW but wait and see if AMD can ship them, at what prices, and if their library ecosystem (not even thinking rocm or hip... But even blas, miopen would be enough to start...) has full optimized support.

sorenjan · on Nov 25, 2021

There's also no Windows support, so you cant use it to make consumer programs, or on your gaming rig without dual boot. It's made for data centers with bespoke software, not really for distribution.

zmachinaz · on Nov 25, 2021

I have been using ROCm for 2y+. The investment in this infrastructure was a big mistake. The biggest burner was the need to do a clean install on each new ROCm release. Clean here means manually finding and deleting all traces from the previous ROCm version, and recompilation of apps like pytorch. Good upgrades took hours, bad ones days ... . Finally I settled to freeze the system and not touch it anymore until retirement of the cards, hopefully soon.

rocmasdf · on Nov 26, 2021

Could you describe what is involved in installing a ROCm release from scratch? (I've mostly stayed in CUDA-land, but I'm curious and intrigued about the idea of ROCm, and AMD GPUs, though your experience suggests there is much room for improvement...)

supernovae · on Nov 26, 2021

this is a problem with nvidia too.. i just made all my infra easy to reprovision and start clean and workloads ran as containers..

jacquesm · on Nov 26, 2021

Interesting, can you describe this in a bit more detail? It runs completely counter to my experience, so far NVidia for me has just been a long string of 'boring' in that it just works. Even applications written for older cards and older versions of CUDA have continued to work just fine.

supernovae · on Nov 27, 2021

It was so bad, we just moved to immutable GPU infrastructure regardless of physical or virtual. When a new release of all the nvidia stuff comes out, we re-image the machine and install it.

Cuda on linux with ml/gpu workloads is still kind of a hotmess and i'd say we're far from finding a winner like some suggest here.

It's gotten better... but still far easier to treat it like a mess and start fresh with any install

ClumsyPilot · on Nov 25, 2021

Is this like 3rd or 4th GPGPU programming framework to come out of AMD? They will never get any adoption at this rate.

esistgut · on Nov 25, 2021

https://www.reddit.com/r/Amd/comments/r1gb05/radeon_6600xt_c... this show some numbers on ROCm performances.

stuaxo · on Nov 25, 2021

Having to choose between Steam support and ROCm drivers is a pain - it stops tinkering.

Almost everyone on Linux will have experience of breaking their drivers at some point, and installing another alternate set is a big risk.

It seems silly to not have OpenCL and HIP access without having to use this alternate stack.

slavik81 · on Nov 25, 2021

The ROCm and AMDGPU PRO stacks were unified with ROCm 4.5 and AMDGPU 21.40. I would expect Steam to work. That was just a couple weeks ago, but have you tried it out?

blihp · on Nov 26, 2021

Is there some reason AMD seems uninterested (at best) in supporting an open source compute stack? Not everyone is running one of the distros supported by ROCm which will discourage more than a few users.

rasz · on Nov 26, 2021

AMD is not a software company. They barely support their current products. Everything is slow and full of bugs. Afaik they had a thing where they fired GPU driver developers before deciding to rewrite everything from scratch - losing expertise and institutional knowledge, new ones were slower for a long timer. They even managed to fuck up AMD Platform !Security! Processor (PSP) chipset driver allowing attackers dumping of random memory pages.

Gigachad · on Nov 26, 2021

AMD isn't even interested in supporting their best selling cards with ROCm. The whole thing seems to be illogical and against their own interests.

vetinari · on Nov 25, 2021

Isn't 4.5 also the version, that kicked Vega64 to the curb?

So even if I wanted, I can't. Sincerely, I'm fed up with AMD's attitude towards compute.

my123 · on Nov 25, 2021

> Isn't 4.5 also the version, that kicked Vega64 to the curb?

ROCm 4.5 is the _last_ version to support the Vega10 ASIC (MI25, Vega56, Vega64).

https://github.com/RadeonOpenCompute/ROCm/#amd-instinct-mi25...

The next ROCm release after 4.5 is sometime in Q1 next year. So it's on planned death really soon.

It is transitioning to _that_ comical AMD "enabled in the codebase but not tested and not supported" state, rotting slowly like Polaris support did.

slavik81 · on Nov 25, 2021

The MI25 line predates when I joined AMD, so I don't know much about it. Were they for sale to the general public?

I was concerned about that as well, but I don't personally know anyone who owns a gfx900 card. I'm a little unclear on what impact it will have on the community.

my123 · on Nov 26, 2021

> Were they for sale to the general public?

The MI25 wasn't targeted at the general public, but it wasn't hard to buy one. And the customer products using that same die, Vega 56 and 64, were sold quite a bit.

What affected the community severely might be the combination of both GFX8 and gfx900 going away, leaving only MI50 (Vega20, also used in Radeon VII consumer variant) and no support (only unofficial enablement) for the later customer products. Because those went GFX10/Navi.

slavik81 · on Nov 26, 2021

Thank you. I found your post helpful for better understanding the situation.

esistgut · on Nov 25, 2021

Shouldn't it work on top of the open source drivers?

bilog · on Nov 26, 2021

It does, a least in my case (AMD RX580 with distro amdgpu driver).

karmakaze · on Nov 25, 2021

Add an indication as to where 'ROCm' comes from. Wikipedia redirects to [0]. Still have no clue where the 'm' comes from (is it the last letter of platfor'm'?)

[0] https://en.wikipedia.org/wiki/GPUOpen#Radeon_Open_Compute_(R...

slavik81 · on Nov 25, 2021

It used to be "Radeon Open Compute platform" with the m coming from platform. It's now "Radeon Open Ecosystem" with the c and the m from ecosystem [1].

The old expansion still shows up in a few places where it's difficult to remove.

[1]: https://github.com/ROCmSoftwarePlatform/rocBLAS#rocblas

karmakaze · on Nov 25, 2021

Ah so (R)adeon (O)pen E(c)osyste(m) stylized ROCm. I can see why it's not often spelled out, thanks.

shmerl · on Nov 26, 2021

There should be some Rust based generic GPU programming solution that's not tied to any specific GPU. That should be able to replace CUDA and Co. in the long run.