I bet automatic this part will be simple. In general LLMs that have a given semantical ability "X" to do some task, have greater than X ability to check, among N replies about doing the same task, which reply is the best, especially if via binary tournament like RAInk did (it was posted here a few weeks ago). There is also the possibility to use agreement among different LLMs. I'm surprised Gemini 2.5 PRO was not used here, in my experience it is the most powerful LLM to do that kind of stuff.