> AI is not safe, and is not aligned to human interests
It is “aligned” to human utterances instead. We don’t want AIs to actually be human-like in that sense. Yet we train them with the entirety of human digital output.
The current state of the art is RLHF (reinforcement learning with human feedback); initially trained to complete human utterances, plus fine-tuning to maximize human feedback on whether the completion was "helpful" etc.
It is “aligned” to human utterances instead. We don’t want AIs to actually be human-like in that sense. Yet we train them with the entirety of human digital output.