Great question. The language models are definitely beating the old tools. Take a look at Gemini for example.
Doctly runs a tournament style judge. It will run multiple generations across LLMs and pick the best one. Outperforming single generation and single model.