It's disappointing to see that the benchmark results are so opaque. I hope we se...

It's disappointing to see that the benchmark results are so opaque. I hope we see reproducible results soon, and hopefully from Mistral themselves.

1. We don't know what the evaluation setup is. It's very possible that the ranking would be different with a bit of prompt engineering.

2. We don't know how large each dataset is (or even how the metrics are calculated/aggregated). The metrics are all reported as XY.ZW%, but it's very possible that the .ZW% -- or even Y.ZW% -- is just noise.[1]

3. We don't know how the datasets were mined or filtered. Mistral could have (even accidentally!) filtered out particularly data points that their model struggled with. (E.g., imagine good-meaning engineer testing a document with Mistral OCR first, finding it doesn't work, and deducing that it's probably bad data and removing it.)

[1] https://medium.com/towards-data-science/digit-significance-i...