Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is a very bad idea for image models. They pick up and amplify imperceptible distortions in images no human reviewer would catch... Not to speak of big ones when the output is straight up erroneous.

This may apply to text too.

Partial or fully synthetic data is OK when finetuning existing LLMs. I personally discovered its not OK for finetuning ESRGAN. Not sure about diffusion models.



> Not sure about diffusion models.

Diffusion models are still approximate density estimators, not explicit. They lose information because you don't have an unique mapping to the subsequent step. Got to think about the relationships of your image and preimage.

So while they have better distribution that GANs, they still aren't reliable for dataset synthesis. But they are better than GANs for that (GANs will be very mean focused, which is why we had such high quality images from them but we also see huge diversity issues and amplification of biases).


> Not sure about diffusion models.

Human-curated synthetic data is commonly used in finetuning (or LoRa-training) for SD. I doubt that uncurated synthetic data would be very usable. There might be use cases where curating synthetic data with some kind of vision model would be valuable, but my intuition would be that it would be largely hit-or-miss and hard to predict.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: