I believe it's more like they used Anthropic human preference data [1] or simila...

I believe it's more like they used Anthropic human preference data [1] or similar, and accordingly Anthropic/progressive American notion of honest-helpful-harmless behavior. Thus I've seen models misgeneralize towards prudish finger-wagging. For example they parse badwords like "beat", "abuse", "steal" in morally neutral contexts ("beat a benchmark" or something) as signifiers of substantial transgression and spiral into telling me how, as language models, they insist it's never okay to etc. etc. This attitude was strikingly reminiscent of American models, even though other failure modes – like hallucinations – don't seem so similar.

Papers like Tulu [2] suggest that LLaMA-65b is indeed an appropriate baseline, given reasonable prompting. Instruct datasets only convey a flavor of responses, and for a strong foundation model that can infer the intended flavor on its own, naive finetuning seems to be detrimental. GPT-4 was much more powerful prior to having been finetuned, if reports of early witnesses and researchers are to be believed.

1. https://huggingface.co/datasets/Anthropic/hh-rlhf

2. https://arxiv.org/abs/2306.04751