I agree with @ActivePattern and thank you for your help in answering.
Supplement for @f_devd:
During training, the K outputs share the stem feature from the NN blocks, so generating the K outputs costs only a small amount of extra computation. After L2-distance sampling, discarding the other K-1 outputs therefore incurs a negligible cost and is not comparable to discarding K-1 MoE experts (which would be very expensive).
In DDN, 1×1 convolutions are used only in the output layers of the Discrete Distribution Layer (DDL).
The NN blocks between DDLs, which supply the fundamental computational power and parameter count, adopt standard 3×3 convolutions.
We provide the source code and weights along with a Docker environment to facilitate reproducing the experimental results.
The original paper’s EXPERIMENTS section mentions the hardware configuration (8× RTX 2080 Ti).
3. Answers or relevant explanations to any other questions can be found directly in the original paper (https://arxiv.org/abs/2401.00036), so I won’t restate them here.
I believe it is the novelty. Here I would like to quote Reviewer r4YK’s original words:
> Many high rated papers would have been done by someone else if their authors never published them or were rejected. However, if this paper is not published, it is not likely that anyone would come up with this approach. This is real publication value. I am reminding again the original diffusion paper from 2015 (Sohl-Dickstein) that was almost not noticed for 5 years. Had it not been published, would we have had the amazing generative models we have today?
I believe DDN is exceptionally well-suited to the “generative models for discriminative tasks” paradigm for object detection.
Much like DiffusionDet, which applies diffusion models to detection, DDN can adopt the same philosophy.
I expect DDN to offer several advantages over diffusion-based approaches:
- Single forward pass to obtain results, no iterative denoising required.
- If multiple samples are needed (e.g., for uncertainty estimation), DDN can directly produce multiple outputs in one forward pass.
- Easy to impose constraints during generation due to DDN's Zero-Shot Conditional Generation capability.
- DDN supports more efficient end-to-end optimization, thus more suitable for integration with discriminative models and reinforcement learning.
Similarities:
- Both map data to a discrete latent space.
Differences:
- VQ-VAE needs an external prior over code indices (e.g. PixelCNN or a hierarchical prior) to model distribution. DDN builds its own hierarchical discrete distribution and can even act as the prior for a VQ-VAE-like system.
- DDN’s K outputs are features that change with the input; VQ-VAE’s codebook is a set of independent parameters (embeddings) that remain fixed regardless of the input.
- VQ-VAE produces a 2-D grid of code indices; DDN yields a 1-D/tree-structured latent.
- VQ-VAE needs Straight-Through Estimator.
- DDN supports zero-shot conditional generation.
So I’d call them complementary rather than “80 % the same.” (See the paper’s “Connections to VQ-VAE.”)
The first version of DDN was developed in less than three months, almost entirely by one person. Consequently, the experiments were preliminary and the results far from SoTA.
Yes, it's absolutely possible—just like how diffusion LLMs work, we can do the same with DDN LLMs.
I made an initial attempt to combine [DDN with GPT](https://github.com/Discrete-Distribution-Networks/Discrete-D...), aiming to remove tokenizers and let LLMs directly model binary strings. In each forward pass, the model adaptively adjusts the byte length of generated content based on generation difficulty (naturally supporting speculative sampling).
This is what I find most impressive, that it's a natural hierarchial method which seems so general, yet is actually quite competitive. I feel like the machine learning community has been looking for that for a long time. Non-generative uses (like hierarchial embeddings, maybe? Making Dewey's decimal like embeddings for anything!) are even more exciting.
Exactly! The paragraph on Efficient Data Compression Capability in the original paper also highlights:
> To our knowledge, Taiji-DDN is the first generative model capable of directly transforming data into a semantically meaningful binary string which represents a leaf node on a balanced binary tree.
I vouched for this comment. Your account seems to be shadow banned, but your last comments look fine to me, so you maybe want to email dang to revoke that status ..
Supplement for @f_devd:
During training, the K outputs share the stem feature from the NN blocks, so generating the K outputs costs only a small amount of extra computation. After L2-distance sampling, discarding the other K-1 outputs therefore incurs a negligible cost and is not comparable to discarding K-1 MoE experts (which would be very expensive).
reply