Anybody able to get the "View Technical Report" button at the bottom to do anyth...

albertzeyer · 2025-05-01T16:39:32 1746117572

It links to this file: https://drive.google.com/file/d/1j1ofmm8iBaVreGC5TSF1oLsrOqB...

But all but the first page seems to be missing in this PDF? There is just an abstract and (partial) outline.

krackers · 2025-05-03T04:50:03 1746247803

There's at least some discussion in https://www.lesswrong.com/posts/pLnLSgWphqDbdorgi/on-the-imp...

>Instead of generating tokens one at a time, a dLLM produces the full answer at once. The initial answer is iteratively refined through a diffusion process, where a transformer suggests improvements for the entire answer at once at every step. In contrast to autoregressive transformers, the later tokens don’t causally depend on the earlier ones (leaving aside the requirement that the text should look coherent). For an intuition of why this matters, suppose that a transformer model has 50 layers and generates a 500-token reasoning trace, the final token of this trace being the answer to the question. Since information can only move vertically and diagonally inside this transformer and there are fewer layers than tokens, any computations made before the 450th token must be summarized in text to be able to influence the final answer at the last token. Unless the model can perform effective steganography, it had better output tokens that are genuinely relevant for producing the final answer if it wants the performed reasoning to improve the answer quality. For a dLLM generating the same 500-token output, the earlier tokens have no such causal role, since the final answer isn’t autoregressively conditioned on the earlier tokens. Thus, I’d expect it to be easier for a dLLM to fill those tokens with post-hoc rationalizations.

>Despite this, I don’t expect dLLMs to be a similarly negative development as Huginn or COCONUT would be. The reason is that in dLLMs, there’s another kind of causal dependence that could prove to be useful for interpreting those models: the later refinements of the output causally depend on the earlier ones. Since dLLMs produce human-readable text at every diffusion iteration, the chains of uninterpretable serial reasoning aren’t that deep. I’m worried about the text looking like gibberish at early iterations and the reasons behind the iterative changes the diffusion module makes to this text being hard to explain, but the intermediate outputs nevertheless have the form of human-readable text, which is more interpretable than long series of complex matrix multiplications.

Based solely on the above, my armchair analysis is that it seems like it's not strictly diffusion in the Langevin diffusion/denoising sense (since there are discrete iteration rounds), but instead borrows the idea of "iterative refinement". You drop the causal masking and token-by-token autoregressive generation, and instead start with a bunch of text and propose a series of edits at each step? On one hand dropping the causal masking over token sequence means that you don't have an objective that forces the LLM to learn a representation sufficient to "predict" things as normally thought, but on the flipside there is now a sort of causal masking over _time_, since each iteration depends on the previous. It's a neat tradeoff.

Subthread https://news.ycombinator.com/item?id=43851429 also has some discussion