I've seen the CanAiCode leaderboard several times (and used many of the models listed), but I wouldn't use it to pick a model. It's not a bad list, but the benchmark is too limited. The results are not accurately ranked from best to worst.
For example the deepseek 33b model is ranked 5 spots lower than the 6.7b model, but the 33b model is definitely better. WizardCoder 15b is near the top while WizardCoder 33b is ranked 26 spots lower, which is a wildly inaccurate ranking.
It's worth noting that those 33b models score in the 70s for HumanEval and HumanEval+ while the 15b model scores in the 50s.
I've seen the CanAiCode leaderboard several times (and used many of the models listed), but I wouldn't use it to pick a model. It's not a bad list, but the benchmark is too limited. The results are not accurately ranked from best to worst.
For example the deepseek 33b model is ranked 5 spots lower than the 6.7b model, but the 33b model is definitely better. WizardCoder 15b is near the top while WizardCoder 33b is ranked 26 spots lower, which is a wildly inaccurate ranking.
It's worth noting that those 33b models score in the 70s for HumanEval and HumanEval+ while the 15b model scores in the 50s.