Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"No, I will not tell you how I did it. Learn to respect the unknown unknowns."



Which, given his general mission of making sure hostile AI DOESN'T take over the world is a bit self defeating. The easiest way to inoculate yourself against a persuasive technique is to be aware of it ahead of time. If you want to keep an AI in the box you should absolutely release every successful log.


No, the idea is that the AI box is fundamentally flawed. I believe he defends engineering the AI with fundamental safety, s.t. no box is required.

Personally, I think we'd need a much more intelligent and complex AI for the capability of breaking free of the box and even possessing the "desire" than we're getting for the foreseeable future (it considers a motivated AI of almost limitless knowledge about the world and cleverness), so this thought experiment may not be so relevant. I agree with him the boxing approach is not a robust one though.


> The easiest way to inoculate yourself against a persuasive technique is to be aware of it ahead of time.

Or you can avoid being exposed to it. If you think you know all the techniques an AI might use against you, you're less likely to do that.

The point of the experiment isn't "let's work out how an AI might try to persuade us to let it out". It's "even a human intelligence can persuade people who think they could never be persuaded, do you really trust yourself to do better against a superhuman one?"

If you don't know why the gatekeeper failed, it's harder to come up with bullshit reasons why you would have succeeded in that position.


Not necessarily. As a lighter example, would it be beneficial to give a mass murderer in jail a communication channel to the outside world? What if he used it to publish the message "I'll give 10 Million Dollars to anybody who breaks me out of jail" or something more sinister?

Edit: huh, downvotes? Yudorowski thinks there are certain things that AIs could say that should not be known. I think that is why he doesn't want to publish the dialogues, because it would give the AI a public communications channel. While the AI is fictional, it could talk about a hypothetical future real self... Instead of promising something to get it out of jail, the fictional AI could say something to make you make it real. Anyway - if it is over your head, fine, but why downvote just because you don't understand something?

Edit2: Sometimes I wonder if already have my personal Hacker News AI that automatically downvotes everything I write...


The AI won't be limited to techniques that you could think of, or techniques that Eliezer could think of. So you'd only get a false sense of security.

Besides, releasing a successful log might be a bad idea for other reasons. Think about how you'd play this game as an AI. You wouldn't go looking for a general purpose mindfuck, because there's probably no such thing. Instead, you would probably spend about a month gathering real life information about the gatekeeper's history, family, weaknesses etc. You'd read books on manipulation and sales techniques, and pick the strongest ones that you can find. You would brainstorm possible tactics and run tests. At the end of the month you'd have a 4 hour script with all possible unfair moves you could use against that person, arranged in the most effective order. (That's why it's a bad idea to play this game with friends.) Do you really want that information to be released? And if you know ahead of time that it will be released, won't it limit your efficiency?


So you reckon as the AI player he blackmailed the gatekeeper player? "Let me out or I'll tell your friends/family/co-workers x about you" type of thing?


It's more about finding buttons to push. For example, Justin Corwin won one of his games against a religious woman by telling her that she shouldn't play God by keeping him locked up for a subjective eternity (it was more involved, but you get the point). You could come up with other tactics if you know the gatekeeper is divorced, or donates to charity, or is an immigrant, etc. Really, you'll be surprised by how much progress you can make on an "impossible" problem if you just spend five minutes thinking without flinching away.


"Homeopathy works. Learn to respect the unknown unknown"


Personally I think it's highly suspicious he hasn't released the logs, maybe he did break the rules, who knows.


Isn't it by definition impossible to respect the unknown unknown on the account of it being unknown?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: