Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I welcome new models! The more, the merrier.

That said, this model has been tailored but they are comparing it to non-finetuned LLaMA-7B in their benchmark? That seems a bit fainthearted.



> HumanEval: InternLM-7B: 10.4, LLaMA-7B: 14.0

The funny part is that the base model apparently outperforms the fine tune.

So far the HumanEval benchmark seems to be the only one that can objectively compare overall model performance despite being a coding-only benchmark, the rest mostly just give a "99.7% chatgpt" bullshit results. Turns out you can't compare creative writing because all outputs are basically valid.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: