Unfortunately the intuition and the math proofs so far suggest that autoregressi...

niemandhier · 2025-05-01T07:21:43 1746084103

This is tremendously interesting!

Could you point me to some literature? Especially regarding mathematical proofs of your intuition?

I’d like to recalibrate my priors to align better with current research results.

GistNoesis · 2025-05-01T11:32:31 1746099151

From the mathematical point of view the literature is about the distinction between a "filtering" distribution and a "smoothing" distribution. The smoothing distribution is strictly more powerful.

In theory intuitively the smoothing distribution has access to all the information that the filtering distribution has and some additional information therefore has a minimum lower than the filtering distribution.

In practice, because the smoothing input space is much bigger, keeping the same number of parameters we may not reach a better score because with diffusion we are tackling a much harder problem (the whole problem), whereas with autoregressive models we are taking a shortcut which happens to probably be one that humans are probably biased too (communication evolved so that it can be serialized to be exchanged orally).

pama · 2025-05-01T19:20:55 1746127255

Although what you say about smoothing vs filtering is true in principle, for conditional generation of the eventual joint distribution starting from the same condition and using an autoregresive vs diffusive LLM, it is the smoothing distribution that has less power. In other words, during inference starting from J tokens and writing token number K is of course better with diffusion if you also have some given tokens after token K and up to the maximal token N. However, if your input is fixed (tokens up to J) and you have to predict those additional tokens (from J+1 to N), you are solving a harder problem and have a lower joint probability at the end of the inference for the full generated sequence from J+1 up to N.

pama · 2025-05-01T09:19:41 1746091181

I am still jetlagged and not sure what the most helpful reference would be. Maybe start from the block diffusion paper I recommended in a parallel thread and trace your way up/down from there. The logic leading to Eq 6 is a special case of such a math proof.

https://openreview.net/forum?id=tyEyYT267x

kmacdough · 2025-05-01T08:33:48 1746088428

What are the barriers to mixed architecture models? Models which could seamlessly pass from autoregressive to diffusion, etc.

Humans can integrate multiple sensory processing centers and multiple modes of thought all at once. It's baked into our training process (life).

pama · 2025-05-01T09:10:14 1746090614

The human processing is still autoregressive, but using multiple parallel synchronized streams. There is no problem with such an approach and my best guess is that in the next year we will see many teams training models using such tricks for generating reasoning traces in parallel.

The main concern is taking a single probabilistic stream (eg a book) and comparing autoregressive modelling of it with a diffusive modelling of it.

Regarding mixing diffusion and autoregressive—I was at ICLR last week and this work is probably relevant: https://openreview.net/forum?id=tyEyYT267x

cchance · 2025-05-01T18:35:11 1746124511

Maybe diffusion for "thoughts" and autoregressive for output :S