This is a much better leaderboard: https://evalplus.github.io/leaderboard.html I...

This is a much better leaderboard: https://evalplus.github.io/leaderboard.html

I've seen the CanAiCode leaderboard several times (and used many of the models listed), but I wouldn't use it to pick a model. It's not a bad list, but the benchmark is too limited. The results are not accurately ranked from best to worst.

For example the deepseek 33b model is ranked 5 spots lower than the 6.7b model, but the 33b model is definitely better. WizardCoder 15b is near the top while WizardCoder 33b is ranked 26 spots lower, which is a wildly inaccurate ranking.

It's worth noting that those 33b models score in the 70s for HumanEval and HumanEval+ while the 15b model scores in the 50s.