I welcome new models! The more, the merrier. That said, this model has been tail...

moffkalast · on July 6, 2023

> HumanEval: InternLM-7B: 10.4, LLaMA-7B: 14.0

The funny part is that the base model apparently outperforms the fine tune.

So far the HumanEval benchmark seems to be the only one that can objectively compare overall model performance despite being a coding-only benchmark, the rest mostly just give a "99.7% chatgpt" bullshit results. Turns out you can't compare creative writing because all outputs are basically valid.