Seriously this is the part I dont understand about people parroting "prompt engineering" . Isn't it really just throwing random things at a non deterministic black box and hoping for the best?
I find it's more like that silly experiment where you have to make a sandwich exactly as a kid (or adult) writes the instructions. You _think_ you have a good set of instructions and then you get peanut butter on the outside. So, you revisit the instructions to be clearer about what you want done. That's how I see prompt engineering. In that case, you are simply learning how the model tends to follow instructions and crafting a prompt around that. Not so much random, more purposeful.
> That isn’t the model reasoning. That’s you figuring out exactly what parameters you need to use to make the model give the result you want.
If its to get the model to present a fixed answer, sure.
If its to get a model to do a better job at solving general classes of problems (such as when what you are optimizing is the built-in prompt in a ReAct/Reflexion implementation, not the prompt for a specific problem), that's, at a minimum, different from Clever Hans, even if its not “reasoning” (which is ill-defined).
If someone says they're fine tuning a model (which is changing which layers are activated for a given input) it's generally well tolerated.
If someone says they're tuning a prompt (which is changing which layers are activated for a given input) it's met with extreme skepticism.
At the end of the day ML is probabilistic. You're always throwing random things at a black box and hoping for the best. There are strategies and patterns that work consistently enough (like ReACT) that they carry across many tasks, and there are some that you'll find for your specific task.
And just like any piece of software you define your scope well, test for things within that scope, and monitor for poor outputs.
> If someone says they're fine tuning a model (which is changing which layers are activated for a given input) it's generally well tolerated.
> If someone says they're tuning a prompt (which is changing which layers are activated for a given input) it's met with extreme skepticism.
There are good reasons for that though. The first is the model-owner tuning so that given inputs yield better outputs (in theory for other users too). The second is relying on the user to diagnose and fix the error. That being the "fix" is a problem if the output is supposed to be useful to people who don't know the answers themselves, or if the model is being touted as "intelligence" with a natural language interface, which is where the scepticism comes in...
I mean, a bugfix, a recommendation not to use the 3rd menu option or a "fork this" button are all valid routes to change the runtime behaviour of a program!
(and yes, I get that the "tuning" might simply be creating the illusion that the model approaches wider usability, and that "fine tuning" might actually have worse side effects. So it's certainly reasonable to argue that when a company defines its models' scope as "advanced reasoning capabilities" the "tuning" might also deserve scepticism, and conversely if it defines its scope more narrowly as something like "code complete" there might be a bit more onus on the user to provide structured, valid inputs)
Neither option implies you own the model or don't: OpenAI owns the model and uses prompt tuning for their website interface, which is why it changes more often than the underlying models themselves. They also let you fine tune their older models, which you don't own.
You also seem to be missing that in this context prompt tuning and fine tuning are both about downstream tasks where the "user" is not you as an individual who's fine tuning and improve prompts, but the people (plural) who are using the now improved outputs.
These aren't the contexts that invite the scepticism though (except when the prompt is revealed after blowing up Sydney-style!)
The "NN provided incorrect answer to simple puzzle; experts defend the proposition the model has excellent high-level reasoning ability by arguing user is 'not good at prompting'" context is, which (amid more legitimate gripes about whether the right model is being used) is what is happening in this thread.
Technically I'm taking a large liberty saying you're "activating layers", all the layers are affecting the output and you don't pick and choose them
But you can imagine the model like a plinko board: just because the ball passes every peg, doesn't mean every peg changed it's trajectory.
When you fine tune a model, you're trying to change how the pegs are arranged so the ball falls through the board differently.
When you prompt tune you're changing how the ball will fall too. You don't get to change the board, but you can change where the ball starts or have the ball go through the board several more times than normal before the user sees it, etc.
You can't see the ball falling (which layers are doing what), only where it falls, but when you spend long enough building on these models, you do get an intuition for which prompts have an outsized effect on where the ball will land.
No, its not. While GPT-4 (like some but not all other LLMs) is somewhat nondeterministic (even at zero temperature), that doesn’t mean there aren’t things that have predictable effects on the distribution of behavior that can be discovered and leveraged.
There’s even a term of art for making a plan up front and then hitting it with a low-skew latent space match: “Chain of Thought”. Yeah, it’s seen numbered lists before.
And if at first you don’t succeed, anneal the temperature and re-roll until you’ve got something that looks authentic.
You got me beat: IMHO these things are plenty friggin awesome already and getting cooler all the time. I don't see why there is so much ink (and money) being spilled trying to get them to do things more easily done other ways.
Language models are really good at language tasks: summarization, sentiment analysis, borderline-creepy convincing chatbots, writing pretty good fiction at least in short form, the list goes on and on. At all of the traditional NLP stuff they are just super impressive.
They already represent an HCI revolution with significance something like the iPhone as a lower bound: it's a super big deal.
But while the details are absurdly complicated and the super modern ones represent an engineering achievement up there with anything ever done on a computer, they still fundamentally predict some probability-like metric (typically still via softmax [0]) based on some corpus of tokenized language (typically still via byte-pair [1]).
And when the corpus has a bunch of conversations in it? Great at generating conversations! And when the corpus has some explanations of logical reasoning? Often passably good at looking logical. And when the corpus has short stories, novellas, and novels featuring conversations between humans and science-fiction AIs? Well they can sample from that too.
But imitating William Gibson doesn't make GPT-4 any kind of sentient any more than it makes me a once-in-a-generation science fiction author.
“Real motive problem, with an AI. Not human, see?”
“Well, yeah, obviously.”
“Nope. I mean, it’s not human. And you can’t get a handle on it. Me, I’m not human either, but I respond like one. See?”
“Wait a sec,” Case said. “Are you sentient, or not?”
“Well, it feels like I am, kid, but I’m really just a bunch of ROM. It’s one of them, ah, philosophical questions, I guess...” The ugly laughter sensation rattled down Case’s spine. “But I ain’t likely to write you no poem, if you follow me. Your AI, it just might. But it ain’t no way human.” [2]