I would run simple llama.cpp batch jobs for 10 minutes when it would suddenly fail, and require a restart. Random VM_L2_PROTECTION_FAULT in dmesg, something having to do with doorbells. I did report this, never heard back from them.
Did you run on the blessed Ubuntu version with the blessed kernel version and the blessed driver version? As otherwise you really are in a development branch.
If you can point me to a repro I'll add it to my todo list. You can probably tag me in the github issue if that's where you reported it.
I feel like this goes both ways. You also don't want to have to run bleeding edge for everything because there are so many bugs in things. You kind of want known stable versions to at least base yourself off of.