The funny part is that the base model apparently outperforms the fine tune.
So far the HumanEval benchmark seems to be the only one that can objectively compare overall model performance despite being a coding-only benchmark, the rest mostly just give a "99.7% chatgpt" bullshit results. Turns out you can't compare creative writing because all outputs are basically valid.
That said, this model has been tailored but they are comparing it to non-finetuned LLaMA-7B in their benchmark? That seems a bit fainthearted.