I kind of wonder if maybe they look for certain words in the output (or run it t... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		Uehreka on April 24, 2023 \| parent \| context \| favorite \| on: As an AI Language Model I kind of wonder if maybe they look for certain words in the output (or run it through some sort of sentiment analysis) and if it fails they submit the prompt again with a very strongly worded system prompt (after your prompt) instructing it to reject the command and begin with the phrase “As an AI language model”. Like, I haven’t heard about a way they could actually implement filters this powerful “inside” the model, it feels like it’s probably a less elegant system than we’d imagine.

circuit10 on April 24, 2023 [–]

They use RLHF (reinforcement learning through human feedback) which means they can reward it when it does it and punish it when it doesn’t

They’ve probably done it strongly enough that it can’t really not do it, maybe on purpose to prevent misuse

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact