Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

And further down the page the:

"The camera follows behind a white vintage SUV with a black roof": The letters clearly wobble inconsistently.

"A drone camera circles around a beautiful historic church built on a rocky outcropping along the Amalfi Coast": The woman in the white dress in the bottom left suddenly splits into multiple people like she was a single cell microbe multiplying.



Sure, but think what it will be capable of two papers ahead :)


Progress is this field has not been linear, though. So it's quite possible that two papers ahead we are still in the same place.


On the other hand, this is the first convincing use of a “diffusion transformer” [1]. My understanding is that videos and images are tokenized into patches, through a process that compresses the video/images into abstracted concepts in latent space. Those patches (image/video concepts in latent space) can then be used with transformers (because patches are the tokens). The point is that there is plenty of room for optimization following the first demonstration of a new architecture.

Edit: sorry, it’s not the first diffusion transformer. That would be [2]

[1] https://openai.com/research/video-generation-models-as-world...

[2] https://arxiv.org/abs/2212.09748



I think it is misleading. The role of the diffusion network is completely absent from this explanation


Hold on to your papers~


It’s not perfect, for sure. But maybe this isn’t the final pinnacle of the tech?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: