You may disagree with this take but its not uninformed. Many LLMs use self‑supervised pretraining followed by RL‑based fine‑tuning but that's essentially it - it's fine tuning.
I think you're seriously underestimating the importance of the RL steps on LLM performance.
Also how do you think the most successful RL models have worked? AlphaGo/AlphaZero both use Neural Networks for their policy and value networks which are the central mechanism of those models.