more diyer22's comments

diyer22 · 2025-10-10T17:59:08 1760119148

I agree with @ActivePattern and thank you for your help in answering.

Supplement for @f_devd:

During training, the K outputs share the stem feature from the NN blocks, so generating the K outputs costs only a small amount of extra computation. After L2-distance sampling, discarding the other K-1 outputs therefore incurs a negligible cost and is not comparable to discarding K-1 MoE experts (which would be very expensive).

diyer22 · 2025-10-10T17:40:06 1760118006

In DDN, 1×1 convolutions are used only in the output layers of the Discrete Distribution Layer (DDL). The NN blocks between DDLs, which supply the fundamental computational power and parameter count, adopt standard 3×3 convolutions.

randomNumber7 · 2025-10-10T19:46:10 1760125570

Was there a specific reason for this choice?

diyer22 · 2025-10-11T04:50:50 1760158250

1x1 convolution is the most lightweight operator for transforming features into outputs.

3x3 convolution is the most common operator used to provide basic computational power.

diyer22 · 2025-10-10T15:48:59 1760111339

We provide the source code and weights along with a Docker environment to facilitate reproducing the experimental results. The original paper’s EXPERIMENTS section mentions the hardware configuration (8× RTX 2080 Ti).

Zacharias030 · 2025-10-10T17:41:23 1760118083

Impressive setup :)

diyer22 · 2025-10-10T14:35:48 1760106948

Thank you very much for your interest.

1. The comparison with GANs and the issue of mode collapse are addressed in Q2 at the end of the blog: https://github.com/Discrete-Distribution-Networks/Discrete-D...

2. Regarding scalability, please see “Future Research Directions” in the same blog: https://github.com/Discrete-Distribution-Networks/Discrete-D...

3. Answers or relevant explanations to any other questions can be found directly in the original paper (https://arxiv.org/abs/2401.00036), so I won’t restate them here.

diyer22 · 2025-10-10T13:57:32 1760104652

I believe it is the novelty. Here I would like to quote Reviewer r4YK’s original words:

> Many high rated papers would have been done by someone else if their authors never published them or were rejected. However, if this paper is not published, it is not likely that anyone would come up with this approach. This is real publication value. I am reminding again the original diffusion paper from 2015 (Sohl-Dickstein) that was almost not noticed for 5 years. Had it not been published, would we have had the amazing generative models we have today?

Cite from: https://openreview.net/forum?id=xNsIfzlefG&noteId=Dl4bXmujh1

Besides, we compared DDN with other approaches in the Table 1 of original paper, including VQ-VAE.

diyer22 · 2025-10-10T13:38:59 1760103539

Thank you for your appreciation. I will update the future work on both GitHub and Twitter.

https://github.com/DIYer22 https://x.com/diyerxx

diyer22 · 2025-10-10T12:55:06 1760100906

I believe DDN is exceptionally well-suited to the “generative models for discriminative tasks” paradigm for object detection.

Much like DiffusionDet, which applies diffusion models to detection, DDN can adopt the same philosophy. I expect DDN to offer several advantages over diffusion-based approaches: - Single forward pass to obtain results, no iterative denoising required. - If multiple samples are needed (e.g., for uncertainty estimation), DDN can directly produce multiple outputs in one forward pass. - Easy to impose constraints during generation due to DDN's Zero-Shot Conditional Generation capability. - DDN supports more efficient end-to-end optimization, thus more suitable for integration with discriminative models and reinforcement learning.

porridgeraisin · 2025-10-10T16:51:43 1760115103

Yep, the mental model I have from a cursory read of the paper is "generative decision tree".

diyer22 · 2025-10-10T12:17:34 1760098654

No, DDN and VQ-VAE are clearly different.

Similarities: - Both map data to a discrete latent space.

Differences: - VQ-VAE needs an external prior over code indices (e.g. PixelCNN or a hierarchical prior) to model distribution. DDN builds its own hierarchical discrete distribution and can even act as the prior for a VQ-VAE-like system. - DDN’s K outputs are features that change with the input; VQ-VAE’s codebook is a set of independent parameters (embeddings) that remain fixed regardless of the input. - VQ-VAE produces a 2-D grid of code indices; DDN yields a 1-D/tree-structured latent. - VQ-VAE needs Straight-Through Estimator. - DDN supports zero-shot conditional generation.

So I’d call them complementary rather than “80 % the same.” (See the paper’s “Connections to VQ-VAE.”)

diyer22 · 2025-10-10T11:45:54 1760096754

The first version of DDN was developed in less than three months, almost entirely by one person. Consequently, the experiments were preliminary and the results far from SoTA.

The current goal in research is scaling up. Here are some thoughts in blog about future directions: https://github.com/Discrete-Distribution-Networks/Discrete-D...

diyer22 · 2025-10-10T11:24:22 1760095462

Yes, it's absolutely possible—just like how diffusion LLMs work, we can do the same with DDN LLMs.

I made an initial attempt to combine [DDN with GPT](https://github.com/Discrete-Distribution-Networks/Discrete-D...), aiming to remove tokenizers and let LLMs directly model binary strings. In each forward pass, the model adaptively adjusts the byte length of generated content based on generation difficulty (naturally supporting speculative sampling).

vintermann · 2025-10-10T14:33:00 1760106780

This is what I find most impressive, that it's a natural hierarchial method which seems so general, yet is actually quite competitive. I feel like the machine learning community has been looking for that for a long time. Non-generative uses (like hierarchial embeddings, maybe? Making Dewey's decimal like embeddings for anything!) are even more exciting.

diyer22 · 2025-10-10T14:43:42 1760107422

Exactly! The paragraph on Efficient Data Compression Capability in the original paper also highlights:

> To our knowledge, Taiji-DDN is the first generative model capable of directly transforming data into a semantically meaningful binary string which represents a leaf node on a balanced binary tree.

This property excites me just as much.

cubefox · 2025-10-10T19:53:31 1760126011

This sounds a bit like H-Net [1] or Byte Latent Transformer [2].

1: https://arxiv.org/abs/2507.07955

2: https://arxiv.org/abs/2412.09871

diyer22 · 2025-10-11T17:18:27 1760203107

It does seem that way — we’re both trying to overcome the limitations imposed by LLM tokenization to achieve a truly end-to-end model.

And, their work is far more polished; I’ve only put together a quick GPT+DDN proof-of-concept.

Thank you for sharing.

lukan · 2025-10-11T12:57:10 1760187430

I vouched for this comment. Your account seems to be shadow banned, but your last comments look fine to me, so you maybe want to email dang to revoke that status ..

cubefox · 2025-10-11T15:27:22 1760196442

Thanks. I sent an email.