If you have a judge system, and Mistral performs well on other tests, wouldn't you want to include it so if it scores the highest by your judges ranking it would select the most accurate result? Or are you saying that mistral's image markdown would score higher on your judge score?
We'll definitely be doing more tests, but the results I got on the complex tests would result in a lower score and might not be worth the extra cost of the judgement itself.
In our current setup Gemini wins most often. We enter multiple generations from each model into the 'tournament', sometimes one generation from gemini could be at the top while another in the bottom, for the same tournament.