I think the achievement here is mostly about generating the image dynamics, so for example there is a cat in an image, the model understand that cats need to breathe so the dynamics show the lungs contracting, then the paper covers how to traslate the image dynamics and the image itself into a seamless video. I could be wrong tho