Hacker News new | past | comments | ask | show | jobs | submit login
What Is AMD ROCm? (threedots.ovh)
99 points by todsacerdoti on Nov 25, 2021 | hide | past | favorite | 56 comments



I tried ROCm. I bought a supported card (RX570/RX580 series). Within 12 months, AMD dropped support. Newer versions of ROCm didn't work with the card. Older versions didn't actually work either, since all other tooling assumed newer versions. Dependency hell. When things kinda started working in one context, where I could use old tooling (not the one I wanted to use ROCm in), it CUPy was slower than CPU, and then hard crashed my computer randomly. I read a web page that the card can either act as HIP or a graphics card, but not both at the same time. I have no idea if that's right, but if it is, it's dumb.

AMD had no support. Card maker said this didn't fall under warranty. I got burned over and over.

I bought NVidia. It just worked.

I'm working on a potentially major piece of infrastructure, and AMD is accumulating debt. If it worked out-of-the-gate, I imagine we would have kept support. Within 6 more months, we'll be NVidia-specific. AMD will be that much further in the hole for support.

I'd love for ROCm to win, since I think open is critical here. On the other hand, I can't imagine it will. AMD would need to run this as a loss leader for a while, and engineer this at a level to get this competitive with NVidia.

A half-baked product like ROCm seems like a money hole for everyone involved. Customers get burned, and I can't imagine AMD comes out positive.

In the meantime, NVidia is minting gold here.


Yep. This is the same lesson that I learned in 2013 and again in 2015 with OpenCL, both for professional apps and for programming. On the professional app side, there was a lot of "support" for OpenCL that had caveats severe enough to make it unusable. Sure, blender supports OpenCL! You just have to make a custom build and then the GPU target is slower than the CPU target. Sure, Adobe apps support OpenCL -- but any GPU render contexts are solid black boxes and forum posts indicate it has been that way for a year and nobody cared to fix it. Same thing with programming and debugging: there were slides suggesting feature parity with CUDA on a bunch of fronts that just weren't implemented or locked up the computer if you tried to actually use them. It got so bad that I walked away from my sunk costs, sold my AMD cards, ate the ebay tax, ate the nvidia tax, and bought cards that actually worked.

Now I'm in "twice bitten, once shy" mode with AMD. I hate paying the green tax as much as the next guy and I desperately want to have a second source of professional GPUs, but I'm not going to be the guinea pig. Not again. Not for the 3rd time. I want to see someone else successfully using AMD cards for common ML workflows and for blender before I even consider risking it again.


I'm not nearly as invested as you, but for my first "real" GPU compute project of any significant size and impact, I shocked my colleagues and picked OpenCL. All our hardware is nVidia, but I thought I'd make an effort to fight that vendor lock-in. And I find OpenCL quite pleasant! But… my god. OpenCL is a second-class citizen (at best!) on all three of the major platforms. The situation is dire. But the solution can't be to leave the world to CUDA.


> AMD would need to run this as a loss leader for a while, and engineer this at a level to get this competitive with NVidia.

Their problem has been that as little as five years ago they were a dying company. They didn't have the resources to do this right.

That's no longer the case, but once you have the money there is still a lag between then and when the release funded by that money comes out. And even then they're fighting an uphill battle against the perceptions created during their dark age.

Probably the biggest thing they have working for them is Nvidia's behavior. Proprietary everything and single vendor lock in makes everybody chafe, so as soon as they can produce something usable, everyone will want to use it.


I'd take "usable." If AMD was half the performance of NVidia, but open, stable, compatible, and robust, that would reach that bar for me, and I think for a lot of people.

I think that will be an increasingly hard bar to clear, as software becomes coupled to CUDA, though. AMD will be chasing a losing race. They won once with Intel, but this one feels harder....


Yeah Nvidia is so far ahead at this point I wouldn't really risk it on anything else. The problem is that all that troubleshooting adds up fast and the whole DL golden years were built on CUDA. TensorFlow and PyTorch both "support" ROCm and HIP but you run into weird issues very often. A lot of public repository for recent architectures also come with their own CUDA kernels that you need to compile to the vendor lock-in is very strong in my opinion.

And even if AMD's offering was not an absolute dumpster fire, Google, Microsoft and Amazon all have their own accelerator that are maturing and will be more cost effective on the long run.


Intel has done pretty much the same thing with their Xeon-phi series. Promising on paper but without the long term support and dedication from the mothership such efforts are doomed to fail. NVidia really gets it: people care about results, not necessarily how they get there and if that is done in the most 'pure' way possible, if it 'just works' that's good enough.


Khrono's SYCL is the open alternative to CUDA. It's what Intel is using for their oneAPI, there's a CUDA backend and even a HIP backend.


It is not a real alternative, as it lacks the polyglot ecosystem and tooling from CUDA.

It it is the usual Khronos defines the base stuff and hopes for the best regarding their partners.


Nice try, still no way to compare with CUDA, Nvidia is so much ahead and it's hopeless to catch up.


It's something that has very little uptake because it's not supported on mainline GPUs?

I want to use it for compute on something like a rx 6800 and to my knowledge can't


I was going to post exactly the same. And if you look at the github issues of their project you will see that very often it looks outsourced support teams comment on these issues with the standard: we will discuss this internally and get back to you kind of response that enterprises twitter support usually gives.

Not really what you expect from quality engineering. At the end of the day these kind of companies don't understand the value of development and engineering clients as customers.

It's unfortunate really.

EDIT: here's an example:

   ROCmSupport commented on Feb 22 •

   Hi @powderluv
   Thanks for reaching us. I can not comment on RDNA2  support right now.
   We are working on adding a few more new hardware into ROCm environment.
   Please stay tuned via our documentation.
   Thank you.
   
   @ROCmSupport ROCmSupport closed this on Feb 22 

https://github.com/RadeonOpenCompute/ROCm/issues/1390#issuec...


Sounds like a race from support staff to see how many tickets they can close to make themselves look good for the PM at the next review


On the other hand AMD has a fraction of the engineering resources Intel and Nvidia have. They need to make some choice and looking back at the last few years it seems that their choice to focus their efforts on hardware and gaming paid off.


ATI/AMD has always had finicky drivers and engineering decisions IMO.

I guess on the plus side they at least have a more open driver than NVidia (AFAIK nouveau doesn't get any support from them, at least AMD tries to maintain their open source driver on some level.)

And yet, every time I've tried an ATI/AMD Card, the driver experience even in windows has been pretty off-putting, and while I suppose we are finally at a point where one is less likely to be impacted by their issues with 768p overscan on TVs, I wonder what zany quirk they'll come up with next.


> AMD tries to maintain their open source driver on some level.

I think you somehow mistyped "AMD has open source drivers of absofuckinglutely excellent quality supporting hardware of the last ten years or so". No really, they are great. For graphics, that is.


On the flip side, I think that they haven't focused on a compelling compute story means that anyone doing anything other than pure gaming is better served by an Nvidia card.


Yeah I bought an AMD card a few years when they released the new architecture to support them. I ended up grabbing an nvidia card because I don’t play many games but I want to be able to run tensorflow etc but after a year AMD still had little support for any machine learning.


This is why I'm excited about Intel getting into the market.


That's not a valid reason to close a ticket. You close a ticket because the matter is resolved. We work with an internal rule with respect to comments on documents: if you open it the default is that you are going to be the one to close it when you are satisfied your concerns have been addressed. This sometimes gets overruled but that definitely isn't the norm.


It's not officially supported, but I think it would work if you installed the official ROCm 4.5 packages. The RX 6800 is listed as gfx1030 [1], which has been shipping in most libraries since ROCm 4.2. I've heard there were a few bugs, but I've been using it for months without encountering any issues myself.

(I work for AMD on ROCm. All opinions are my own.)

[1]: https://llvm.org/docs/AMDGPUUsage.html#processors


Can you please emphasise to your management chain how important the need for less terrible support and developer relations vis GitHub is. They can close support questions but basically anything in these repos gets closed as fast as possible even feature requests and other things that should be left open.

I doubt they have the funding to meaningfully impact the overall hardware and software support matrix but if they could just make the GitHub repo feel less like I’m back in my days working at a call centre raising tickets to a second level support team in a foreign country who’s only business KPI was tickets closed per day.


I agree with you. The communication between AMD and the community has been less than ideal.

I think it's worth noting, though, it's not always as bad as the example in the sibling comment. The RadeonOpenCompute/ROCm repo catches a lot of questions about big features and the future direction of the project. Those are particularly difficult to answer as an engineer. As much as I'd like to, I can't make a product announcement in a GitHub issue.

If you have a specific technical problem and you open an issue on the repo for the corresponding component, you'll probably have a better experience. Some teams are more responsive than others, but that will at least maximize your odds of successful resolution.


It’s worst because it has no intermediate state there is no guarantee for forward compatibility (backward compatibility is also kinda broken). Shipping anything with HIP will be a pain.

With CUDA you simply target a specific CUDA version and there is full forward and backwards compatibility on any hardware that supports that version.


AMD is much smaller in comparison, and their main focus with ROCM is to get pytorch and tensorflow to work with enterprise GPUs. Everything else is long tail in terms of scale.


This doesn't mean anything to potential customers. If I was in the market for GPU compute, nvidia is the only real option. AMD has had some PR person commenting on github that they have "big news coming very soon" for navi users but it's been over a year now and still no news.


Sadly this is why I (and many others) keep begrudgingly choosing Nvidia cards instead.

Once the developers are familiar with CUDA, what are the chances you'd choose ROCm for deployment? Yeah, not great.

Consumer cards' ROCm support is strategic.


If this is the goal they are better off working with Google on MLIR since they are also focusing on hw acceleration of pytorch and tensorflow code


AMD's support - not just for new or old GPUs, but in terms of all sorts of compatibility changes, usability parity and regressions is staggeringly bad.

Given that compute is important, and many people use their GPUs for compute of some sort, I simply can not understand how that part is so poorly executed on the part of AMD that, in terms of actual application, you might call it entirely absent.

I mean, this is a company that produces compute cards, which supposedly someone in the world must buy and use... but who? Why? I have never seen anyone, and for good reason. And it seems like AMD just... doesn't care?

Like, it's just not a part of their organizational strategy... Compute is on the powerpoint slides, but... no one (can) use it?

It's been going on for years now and I don't get it.


Especially not with NVidia leaving the front door wide open with their licensing terms and other oversights. But as long as AMD doesn't care enough NVidia will grow and become more and more entrenched. It would take a small miracle to displace them by now.


I tink the main workhorse supercomputer GPUs are not out yet.

They announced the MI200 GPU with 128GB of memory, two supercomputers (Setonix, Frontier) are suppose to include them but both will only be launched next year https://www.tomshardware.com/news/setonix-supercomputer-mi20...


Also announced very recently the MI250 with 47 FP32 TFLOPS? Just the morning before the GTC keynote? NVIDIA Lovelace is supposed to be in 2022 (I think?) so it's a good time to ship GPGPU HW but wait and see if AMD can ship them, at what prices, and if their library ecosystem (not even thinking rocm or hip... But even blas, miopen would be enough to start...) has full optimized support.


There's also no Windows support, so you cant use it to make consumer programs, or on your gaming rig without dual boot. It's made for data centers with bespoke software, not really for distribution.


I have been using ROCm for 2y+. The investment in this infrastructure was a big mistake. The biggest burner was the need to do a clean install on each new ROCm release. Clean here means manually finding and deleting all traces from the previous ROCm version, and recompilation of apps like pytorch. Good upgrades took hours, bad ones days ... . Finally I settled to freeze the system and not touch it anymore until retirement of the cards, hopefully soon.


Could you describe what is involved in installing a ROCm release from scratch? (I've mostly stayed in CUDA-land, but I'm curious and intrigued about the idea of ROCm, and AMD GPUs, though your experience suggests there is much room for improvement...)


this is a problem with nvidia too.. i just made all my infra easy to reprovision and start clean and workloads ran as containers..


Interesting, can you describe this in a bit more detail? It runs completely counter to my experience, so far NVidia for me has just been a long string of 'boring' in that it just works. Even applications written for older cards and older versions of CUDA have continued to work just fine.


It was so bad, we just moved to immutable GPU infrastructure regardless of physical or virtual. When a new release of all the nvidia stuff comes out, we re-image the machine and install it.

Cuda on linux with ml/gpu workloads is still kind of a hotmess and i'd say we're far from finding a winner like some suggest here.

It's gotten better... but still far easier to treat it like a mess and start fresh with any install


Is this like 3rd or 4th GPGPU programming framework to come out of AMD? They will never get any adoption at this rate.


https://www.reddit.com/r/Amd/comments/r1gb05/radeon_6600xt_c... this show some numbers on ROCm performances.


Having to choose between Steam support and ROCm drivers is a pain - it stops tinkering.

Almost everyone on Linux will have experience of breaking their drivers at some point, and installing another alternate set is a big risk.

It seems silly to not have OpenCL and HIP access without having to use this alternate stack.


The ROCm and AMDGPU PRO stacks were unified with ROCm 4.5 and AMDGPU 21.40. I would expect Steam to work. That was just a couple weeks ago, but have you tried it out?


Is there some reason AMD seems uninterested (at best) in supporting an open source compute stack? Not everyone is running one of the distros supported by ROCm which will discourage more than a few users.


AMD is not a software company. They barely support their current products. Everything is slow and full of bugs. Afaik they had a thing where they fired GPU driver developers before deciding to rewrite everything from scratch - losing expertise and institutional knowledge, new ones were slower for a long timer. They even managed to fuck up AMD Platform !Security! Processor (PSP) chipset driver allowing attackers dumping of random memory pages.


AMD isn't even interested in supporting their best selling cards with ROCm. The whole thing seems to be illogical and against their own interests.


Isn't 4.5 also the version, that kicked Vega64 to the curb?

So even if I wanted, I can't. Sincerely, I'm fed up with AMD's attitude towards compute.


> Isn't 4.5 also the version, that kicked Vega64 to the curb?

ROCm 4.5 is the _last_ version to support the Vega10 ASIC (MI25, Vega56, Vega64).

https://github.com/RadeonOpenCompute/ROCm/#amd-instinct-mi25...

The next ROCm release after 4.5 is sometime in Q1 next year. So it's on planned death really soon.

It is transitioning to _that_ comical AMD "enabled in the codebase but not tested and not supported" state, rotting slowly like Polaris support did.


The MI25 line predates when I joined AMD, so I don't know much about it. Were they for sale to the general public?

I was concerned about that as well, but I don't personally know anyone who owns a gfx900 card. I'm a little unclear on what impact it will have on the community.


> Were they for sale to the general public?

The MI25 wasn't targeted at the general public, but it wasn't hard to buy one. And the customer products using that same die, Vega 56 and 64, were sold quite a bit.

What affected the community severely might be the combination of both GFX8 and gfx900 going away, leaving only MI50 (Vega20, also used in Radeon VII consumer variant) and no support (only unofficial enablement) for the later customer products. Because those went GFX10/Navi.


Thank you. I found your post helpful for better understanding the situation.


Shouldn't it work on top of the open source drivers?


It does, a least in my case (AMD RX580 with distro amdgpu driver).


Add an indication as to where 'ROCm' comes from. Wikipedia redirects to [0]. Still have no clue where the 'm' comes from (is it the last letter of platfor'm'?)

[0] https://en.wikipedia.org/wiki/GPUOpen#Radeon_Open_Compute_(R...


It used to be "Radeon Open Compute platform" with the m coming from platform. It's now "Radeon Open Ecosystem" with the c and the m from ecosystem [1].

The old expansion still shows up in a few places where it's difficult to remove.

[1]: https://github.com/ROCmSoftwarePlatform/rocBLAS#rocblas


Ah so (R)adeon (O)pen E(c)osyste(m) stylized ROCm. I can see why it's not often spelled out, thanks.


There should be some Rust based generic GPU programming solution that's not tied to any specific GPU. That should be able to replace CUDA and Co. in the long run.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: