A major part of the power of an LLM is the calibrated probability distribution in its responses, and this technique probably throws that ability away. Why is it good enough?
As a brief example, suppose the only possible LLM outputs were "hello world", "food", "hello", and "good day" (and that they're all equally probable with no prompting). Suppose your grammar requires a space in the output somewhere and has no other constraints. If you sampled LLM outputs till something passed the grammer you'd receive "hello world" and "good day" with equal probability. If you apply the website's technique you'll receive "hello world" twice as frequently as "good day".
The core problem is that an answer prefix might have been extremely unlikely to yield a valid response, but the technique (probably -- assuming it succeeds -- my example assumed retries would eventually succeed) constructs a valid response from it regardless. Assuming enough independence in the right places everything is fine and dandy still, but correlated errors compound quickly in autoregressive models.
As a brief JSON-specific question, is an LLM more or less likely to make factual errors (hallucinations, truncated strings, missing main characters, ...) when it produces a response failing to adhere to a schema? If factual error rate relates nontrivially to schema error rate then this path is more perilous than it seems. Given the outsized impact certain words or schmooshed together word-phrases seem to have on LLM output, I'd be surprised if details like schema adherence didn't bleed into other characteristics of the output.
In this case (multiple choice generation), if one of the possible outputs does no match the regex, you can just exclude it from generation.
I am trying to think of an example where "answer prefix might have been extremely unlikely to yield a valid response, but the technique ( ... ) constructs a valid response from it regardless", which might really cause a problem. But to no luck. Anyone has any idea? This could potentially be an interesting research question.
An example from an earlier comment of mine on a different thread (assuming I've understood correctly):
> let's say we had a grammar that had a key "healthy" with values "very_unhealthy" or "moderately_healthy." For broccoli, the LLM might intend to say "very_healthy" and choose "very" but then be pigeonholed into saying "very_unhealthy" because it's the only valid completion according to the grammar.
That said, you can use beam search to more or less solve this problem by evaluating the joint probability of all tokens in each branch of the grammar and picking the one with the highest probability (you might need some more nuance for free-form strings where the LLM can do whatever it wants and be "valid").
This is a concern of mine, as well as limiting the amount that an LLM can talk through a problem - sometimes to nothing. Getting them to work through things IMO dramatically improves their output.
My gut feeling is that taking the output and if it's broken then start fixing it would have a better result - you could even then completely limit the output to only valid json. For your example, if it wrote "very_healthy" and was given an error message explaining that this wasn't an option it had to choose from very_unhealthy" or "moderately_healthy" I would expect a halfway decent model to pick "moderately_healthy".
This has the benefit of allowing you to use a more powerful model for reasoning (like GPT4) and a local model where you can do this kind of token probability manipulation for just fixing the data.
The multiple choice example was just for tractable computations and illustrative purposes. Pretend the LLM has characters===tokens and is doing autoregressive probability prediction as per usual -- "f"-25%, "h"-50%, "g"-25% to start with, and then appropriate probabilities thereafter to yield that multiple-choice example (plus an <end-of-string> token).
> I am trying to think of an example where "answer prefix might have been extremely unlikely to yield a valid response, but the technique ( ... ) constructs a valid response from it regardless", which might really cause a problem. But to no luck. Anyone has any idea? This could potentially be an interesting research question.
At one point in the past ChatGPT (at a model probability layer, not just because of the context window issue) was prone to truncating long JSON responses, and if that happened in a long string field then you'd see the observed behavior. An example application:
(-) You're asking the LLM to turn some written podcast description into something machine-readable. You chunk the input, feed each chunk into the model (somehow; ignore the details; they're not important), and turn paragraphs into {speaker_name: str, timestamp: str, content: str} blobs.
(1) The LLM is prone to turning long paragraphs into `{"content": "the beginning of the content...` patterns, using ellipses to indicate that there's more to that JSON object.
(2) If you actually retry till the LLM succeeds, it's leaps and bounds more likely to end that string with a quotation mark if the string has all the original input. I.e., output like `{"content": "the beginning of the content..."}` is comparatively rare.
(3) The article's technique, however, always morphs those truncated json blobs into valid json. Since the ellipses is _valid_ at that point (a sub-string), instead of the vast majority of inputs failing you instead end up with the vast majority succeeding and having an incorrect ellipses sub-string.
In general, the LLM does autoregressive completions. Imagine two prefixes P1 and P2, each of which can be completed by classes of data so that P1{G1} adheres to the grammar, P1{F1} fails to adhere to the grammar, P2{G2} succeeds, and P2{F2} fails. With retry-till-passing-grammar the weighted probabilities are:
P1{G1}: Chance[P1] Chance[G1 | P1]
P2{G2}: Chance[P2] Chance[G2 | P2]
Whereas the weighted probabilities produced by the technique are:
P1{G1}: Chance[P1]
P2{G2}: Chance[P2]
In both cases you'd need to divide by the total probability, but the convolution by conditionals is both important and notably absent. For very simple schemas like {sentiment: "positive"|"negative"|"neutral"} the results might potentially be similar, but nothing in the idea of a greedy token filter forces that constraint.
As a brief example, suppose the only possible LLM outputs were "hello world", "food", "hello", and "good day" (and that they're all equally probable with no prompting). Suppose your grammar requires a space in the output somewhere and has no other constraints. If you sampled LLM outputs till something passed the grammer you'd receive "hello world" and "good day" with equal probability. If you apply the website's technique you'll receive "hello world" twice as frequently as "good day".
The core problem is that an answer prefix might have been extremely unlikely to yield a valid response, but the technique (probably -- assuming it succeeds -- my example assumed retries would eventually succeed) constructs a valid response from it regardless. Assuming enough independence in the right places everything is fine and dandy still, but correlated errors compound quickly in autoregressive models.
As a brief JSON-specific question, is an LLM more or less likely to make factual errors (hallucinations, truncated strings, missing main characters, ...) when it produces a response failing to adhere to a schema? If factual error rate relates nontrivially to schema error rate then this path is more perilous than it seems. Given the outsized impact certain words or schmooshed together word-phrases seem to have on LLM output, I'd be surprised if details like schema adherence didn't bleed into other characteristics of the output.