The most common elephant foot gun in the room is buggy processes letting queues ...

pas · on Sept 19, 2024

can you elaborate on the details? I have some memories about running OpenStack where Rabbit "was slow", but we never figured out why. mnesia is the storage layer?

nurettin · on Sept 19, 2024

Yes, it was using mnesia as the storage layer, and if I had a few dozen queues with a few hundred messages each, it caused timeouts in some clients (celery/kombu is an example).

I decided to add expiry policies to each queue so that the system cleans itself from stale messages and that fixed all the message dropping issues.

4.0 Changelogs state that they are switching to a new k/v storage (switching from experimental to default)

pas · on Sept 19, 2024

Thanks for the details!

Yep, similar symptoms. (OpenStack's services are also written in Python, or at least were back then, so probably similar to Celery.) We had regular problems with RMQ restarting. (Unfortunately I can't recall if it was for OOM or just some BEAM timeout.)

A few hundred messages in a few dozen queues seem ... inconsequential. I mean whatever on-disk / in-memory data structure mnesia has should be able to handle ~100K stale messages ... but, well, of course there's a reason they switched to a new storage component :)

rhodin · on Sept 19, 2024

Mnesia is _not_ the storage layer for messages (except for delayed messages).

Mnesia stores vhosts, users, permissions, queue definitions and more. This is being transitioned to Khepri, which improves a lot of things (maybe most importantly netsplits) but not directly message speeds.