Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think this approach isn't ideal because you're representing pixels as 150x150 unique bins. With only 71k fonts it's likely a lot of these bins are never used, especially at the corners. Since you're quantizing anyways, you might as well use a convnet then trace the output, which would better take advantage of the 2d nature of the pixel data.

This kind of reminds me of dalle-1 where the image is represented as 256 image tokens then generated one token at a time. That approach is the most direct way to adapt a causal-LM architecture but it clearly didn't make a lot of sense because images don't have a natural top-down-left-right order.

For vector graphics, the closest analogous concept to pixel-wise convolution would be the Minkowski sum. I wonder if a Minkowski sum-based diffusion model would work for svg images.



Thank you for the suggestion. A couple of ML engineers with whom I've spoken after publishing the blog also suggested that I should try representing x and y coordinates as separate tokens.


How would the Minkowski sum be used in the diffusion model? Is the idea to look at the Minkowski sum of the prediction and label?


In pixel space a convnet uses pixel-wise convolutions and a pixel-kernel. If you represent a vector image as a polygon, the direct equivalent to a convolution would be the Minkowski sum of the vector image and a polygon-kernel.

You could start off with a random polygon and the reverse diffusion process would slowly turn it into a text glyph.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: