Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I would run simple llama.cpp batch jobs for 10 minutes when it would suddenly fail, and require a restart. Random VM_L2_PROTECTION_FAULT in dmesg, something having to do with doorbells. I did report this, never heard back from them.



Did you run on the blessed Ubuntu version with the blessed kernel version and the blessed driver version? As otherwise you really are in a development branch.

If you can point me to a repro I'll add it to my todo list. You can probably tag me in the github issue if that's where you reported it.


At least the one I run into, which also says stuff with L2 and doorbells, is https://github.com/ROCm/ROCm/issues/2196 fwiw.


> blessed Ubuntu version with the blessed kernel version

To an SRE, this is a nightmare to read. Cuda is bad in this regard (can often prevent major kernel version updates), but this is worse.


I feel like this goes both ways. You also don't want to have to run bleeding edge for everything because there are so many bugs in things. You kind of want known stable versions to at least base yourself off of.


Would you like to share the model of GPU and versions of various software used?

George has a nice explanation of doorbells:

https://youtu.be/AqPIOtUkxNo?feature=shared&t=968


Same here with SD on 7900XTX. Most of the time for me it's sufficient to reset the card with rocm-smi --gpureset -d 0.


Only "most of the time" ? :(

You'd hope at $15,000+ per unit, you wouldn't have to reset it at all...


It's $1000 per, no? This is one of the gaming cards.


Yep, bought for $1000.

At which price point to be honest, it still shouldn't be needed.

AMD are lucky everyone expects this nowadays, or people might consider sueing.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: