Hacker Newsnew | past | comments | ask | show | jobs | submit | more diyer22's commentslogin

I agree with @ActivePattern and thank you for your help in answering.

Supplement for @f_devd:

During training, the K outputs share the stem feature from the NN blocks, so generating the K outputs costs only a small amount of extra computation. After L2-distance sampling, discarding the other K-1 outputs therefore incurs a negligible cost and is not comparable to discarding K-1 MoE experts (which would be very expensive).


In DDN, 1×1 convolutions are used only in the output layers of the Discrete Distribution Layer (DDL). The NN blocks between DDLs, which supply the fundamental computational power and parameter count, adopt standard 3×3 convolutions.

Was there a specific reason for this choice?

1x1 convolution is the most lightweight operator for transforming features into outputs.

3x3 convolution is the most common operator used to provide basic computational power.


We provide the source code and weights along with a Docker environment to facilitate reproducing the experimental results. The original paper’s EXPERIMENTS section mentions the hardware configuration (8× RTX 2080 Ti).

Impressive setup :)

Thank you very much for your interest.

1. The comparison with GANs and the issue of mode collapse are addressed in Q2 at the end of the blog: https://github.com/Discrete-Distribution-Networks/Discrete-D...

2. Regarding scalability, please see “Future Research Directions” in the same blog: https://github.com/Discrete-Distribution-Networks/Discrete-D...

3. Answers or relevant explanations to any other questions can be found directly in the original paper (https://arxiv.org/abs/2401.00036), so I won’t restate them here.


I believe it is the novelty. Here I would like to quote Reviewer r4YK’s original words:

> Many high rated papers would have been done by someone else if their authors never published them or were rejected. However, if this paper is not published, it is not likely that anyone would come up with this approach. This is real publication value. I am reminding again the original diffusion paper from 2015 (Sohl-Dickstein) that was almost not noticed for 5 years. Had it not been published, would we have had the amazing generative models we have today?

Cite from: https://openreview.net/forum?id=xNsIfzlefG&noteId=Dl4bXmujh1

Besides, we compared DDN with other approaches in the Table 1 of original paper, including VQ-VAE.


Thank you for your appreciation. I will update the future work on both GitHub and Twitter.

https://github.com/DIYer22 https://x.com/diyerxx


I believe DDN is exceptionally well-suited to the “generative models for discriminative tasks” paradigm for object detection.

Much like DiffusionDet, which applies diffusion models to detection, DDN can adopt the same philosophy. I expect DDN to offer several advantages over diffusion-based approaches: - Single forward pass to obtain results, no iterative denoising required. - If multiple samples are needed (e.g., for uncertainty estimation), DDN can directly produce multiple outputs in one forward pass. - Easy to impose constraints during generation due to DDN's Zero-Shot Conditional Generation capability. - DDN supports more efficient end-to-end optimization, thus more suitable for integration with discriminative models and reinforcement learning.


Yep, the mental model I have from a cursory read of the paper is "generative decision tree".

No, DDN and VQ-VAE are clearly different.

Similarities: - Both map data to a discrete latent space.

Differences: - VQ-VAE needs an external prior over code indices (e.g. PixelCNN or a hierarchical prior) to model distribution. DDN builds its own hierarchical discrete distribution and can even act as the prior for a VQ-VAE-like system. - DDN’s K outputs are features that change with the input; VQ-VAE’s codebook is a set of independent parameters (embeddings) that remain fixed regardless of the input. - VQ-VAE produces a 2-D grid of code indices; DDN yields a 1-D/tree-structured latent. - VQ-VAE needs Straight-Through Estimator. - DDN supports zero-shot conditional generation.

So I’d call them complementary rather than “80 % the same.” (See the paper’s “Connections to VQ-VAE.”)


The first version of DDN was developed in less than three months, almost entirely by one person. Consequently, the experiments were preliminary and the results far from SoTA.

The current goal in research is scaling up. Here are some thoughts in blog about future directions: https://github.com/Discrete-Distribution-Networks/Discrete-D...


Yes, it's absolutely possible—just like how diffusion LLMs work, we can do the same with DDN LLMs.

I made an initial attempt to combine [DDN with GPT](https://github.com/Discrete-Distribution-Networks/Discrete-D...), aiming to remove tokenizers and let LLMs directly model binary strings. In each forward pass, the model adaptively adjusts the byte length of generated content based on generation difficulty (naturally supporting speculative sampling).


This is what I find most impressive, that it's a natural hierarchial method which seems so general, yet is actually quite competitive. I feel like the machine learning community has been looking for that for a long time. Non-generative uses (like hierarchial embeddings, maybe? Making Dewey's decimal like embeddings for anything!) are even more exciting.

Exactly! The paragraph on Efficient Data Compression Capability in the original paper also highlights:

> To our knowledge, Taiji-DDN is the first generative model capable of directly transforming data into a semantically meaningful binary string which represents a leaf node on a balanced binary tree.

This property excites me just as much.


This sounds a bit like H-Net [1] or Byte Latent Transformer [2].

1: https://arxiv.org/abs/2507.07955

2: https://arxiv.org/abs/2412.09871


It does seem that way — we’re both trying to overcome the limitations imposed by LLM tokenization to achieve a truly end-to-end model.

And, their work is far more polished; I’ve only put together a quick GPT+DDN proof-of-concept.

Thank you for sharing.


I vouched for this comment. Your account seems to be shadow banned, but your last comments look fine to me, so you maybe want to email dang to revoke that status ..

Thanks. I sent an email.

Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: