It's disappointing to see that the benchmark results are so opaque. I hope we see reproducible results soon, and hopefully from Mistral themselves.
1. We don't know what the evaluation setup is. It's very possible that the ranking would be different with a bit of prompt engineering.
2. We don't know how large each dataset is (or even how the metrics are calculated/aggregated). The metrics are all reported as XY.ZW%, but it's very possible that the .ZW% -- or even Y.ZW% -- is just noise.[1]
3. We don't know how the datasets were mined or filtered. Mistral could have (even accidentally!) filtered out particularly data points that their model struggled with. (E.g., imagine good-meaning engineer testing a document with Mistral OCR first, finding it doesn't work, and deducing that it's probably bad data and removing it.)
1. We don't know what the evaluation setup is. It's very possible that the ranking would be different with a bit of prompt engineering.
2. We don't know how large each dataset is (or even how the metrics are calculated/aggregated). The metrics are all reported as XY.ZW%, but it's very possible that the .ZW% -- or even Y.ZW% -- is just noise.[1]
3. We don't know how the datasets were mined or filtered. Mistral could have (even accidentally!) filtered out particularly data points that their model struggled with. (E.g., imagine good-meaning engineer testing a document with Mistral OCR first, finding it doesn't work, and deducing that it's probably bad data and removing it.)
[1] https://medium.com/towards-data-science/digit-significance-i...