And further down the page the: "The camera follows behind a white vintage SUV wi...

Yiin · on Feb 16, 2024

Sure, but think what it will be capable of two papers ahead :)

csomar · on Feb 16, 2024

Progress is this field has not been linear, though. So it's quite possible that two papers ahead we are still in the same place.

dr_dshiv · on Feb 16, 2024

On the other hand, this is the first convincing use of a “diffusion transformer” [1]. My understanding is that videos and images are tokenized into patches, through a process that compresses the video/images into abstracted concepts in latent space. Those patches (image/video concepts in latent space) can then be used with transformers (because patches are the tokens). The point is that there is plenty of room for optimization following the first demonstration of a new architecture.

Edit: sorry, it’s not the first diffusion transformer. That would be [2]

[1] https://openai.com/research/video-generation-models-as-world...

[2] https://arxiv.org/abs/2212.09748

koconder · on Feb 16, 2024

Here is an explainer https://towardsdatascience.com/explaining-openai-soras-space...

dr_dshiv · on Feb 18, 2024

I think it is misleading. The role of the diffusion network is completely absent from this explanation

fennecbutt · on Feb 18, 2024

Hold on to your papers~

brookst · on Feb 16, 2024

It’s not perfect, for sure. But maybe this isn’t the final pinnacle of the tech?