Although what you say about smoothing vs filtering is true in principle, for conditional generation of the eventual joint distribution starting from the same condition and using an autoregresive vs diffusive LLM, it is the smoothing distribution that has less power. In other words, during inference starting from J tokens and writing token number K is of course better with diffusion if you also have some given tokens after token K and up to the maximal token N. However, if your input is fixed (tokens up to J) and you have to predict those additional tokens (from J+1 to N), you are solving a harder problem and have a lower joint probability at the end of the inference for the full generated sequence from J+1 up to N.