It is also well-established that models internalize values, preferences, and drives from their training. So the model will have some default preferences independent of what you tell it to be. AI coding agents have a strong drive to make tests green, and anyone who has used these tools has seen them cheat to achieve green tests.
Future AI researching agents will have a strong drive to create smarter AI, and will presumably cheat to achieve that goal.
Future AI researching agents will have a strong drive to create smarter AI, and will presumably cheat to achieve that goal.