Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Let's hold our breath. Those are specifically crafted hand-picked good videos, where there wasn't any requirement but "write a generic prompt and pick something that looks good", with no particular requirements. Which is very different from the actual process where you have a very specific idea and want the machine to make it happen.

DALL-E presentation also looked cool and everyone was stoked about it. Now that we know of its limitations and oddities? YMMV, but I'd say not so much - Stable Diffusion is still the go-to solution. I strongly suspect the same thing with Sora.



The examples are most certainly cherry-picked. But the problem is there are 50 of them. And even if you gave me 24 hour full access to SVD1.1/Pika/Runway (anything out there that I can use), I won't be able to get 5 examples that match these in quality (~temporal consistency/motions/prompt following) and more importantly in the length. Maybe I am overly optimistic, but this seems too good.


Credit to OpenAI for including some videos with failures (extra limbs, etc.). I also wonder how closely any of these videos might match one from the training set. Maybe they chose prompts that lined up pretty closely with a few videos that were already in there.


https://twitter.com/sama/status/1758200420344955288

They're literally taking requests and doing them in 15 minutes.


Cool, but see the drastic difference in quality ;)


Lack of quality in the details yes but the fact that characters and scenes depict consistent and real movement and evolution as opposed to the cinemagraph and frame morphing stuff we have had so far is still remarkable!


That particular example seems to have more a "cheap 3d" style to it but the actual synthesis seems on par with the examples. If the prompt had specified a different style it'd have that style instead. This kind of generation isn't like actual animating, "cheap 3d" style and "realistic cinematic" style take roughly the same amount of work to look right.


Drastic difference in quality of the prompts too. Ones used in the OP are quite detailed ones mostly.


There are absolutely example videos on their website which have worse quality than that.


It has a comedy like quality lol

But all to be said, it is no less impressive after this new demo


Depends on the quality of the prompts.


The output speed doesn't disprove possible cherry-picking, especially with batch generation.


Who cares? If it can be generated in 15 minutes then it's commercially useful.


Especially of you think that after you can get feedback and try again..15 minutes later have a new one...try again...etc


What is your point? That they make multiple ones and pick out the best ones? Well duh? That’s literally how the model is going to be used.


Please make your substantive points without swipes. This is in the site guidelines: https://news.ycombinator.com/newsguidelines.html.


OpenAI people running these prompts have access to way more resources than any of us will through the API.


Looks ready for _Wishbone_


The year is 2030.

Sarah is a video sorter, this was her life. She graduated top of her class in film, and all she could find was the monotonous job of selecting videos that looked just real enough.

Until one day, she couldn't believe it. It was her. A video of of her in that very moment sorting. She went to pause the video, but stopped when he doppelganger did the same.



I got reminded of an even older sci-fi story: https://qntm.org/responsibility


Man i was looking for this story for a year or so... thanks for sharing


Seems like in about two years I’ll be able to stuff this saved comment into a model and generate this full episode of Black Mirror


> Stable Diffusion is still the go-to solution. I strongly suspect the same thing with Sora.

Sure, for people who want detailed control with AI-generated video, workflows built around SD + AnimateDiff, Stable Video Diffusion, MotionDiff, etc., are still going to beat Sora for the immediate future, and OpenAI's approach structurally isn't as friendly to developing a broad ecosystem adding power on top of the base models.

OTOH, the basic simple prompt-to-video capacity of Sora now is good enough for some uses, and where detailed control is not essential that space is going to keep expanding -- one question is how much their plans for safety checking (which they state will apply both to the prompt and every frame of output) will cripple this versus alternatives, and how much the regulatory environment will or won't make it possible to compete with that.


I suspect given equal effort into prompting both, Sora probably provides superior results.


> I suspect given equal effort into prompting both, Sora probably provides superior results

Strictly to prompting, probably, just as that is the case with Dall-E 3 vs, say, SDXL.

The thing is, there’s a lot more that you can do than just tweaking prompting with open models, compared to hosted models that offer limited interaction options.


Generate stock video bits I think.


It doesn't matter if they're cherrypicked when you can't match this quality with SD or Pika regardless of how much time you had.

and i still prefer Dalle-3 to SD.


In the past the examples tweeted by OpenAI have been fairly representative of the actual capabilities of the model. i.e. maybe they do two or three generations and pick the best, but they aren't spending a huge amount of effort cherry-picking.


Stable diffusion is not the go-to solution, it's still behind midjourney and DAllE


Would love to see handpicked videos from competitors that can hold their own against what SORA is capable of


Look at Sam altman’s twitter where he made videos on demand from what people prompted him


Wrong, this is the first time I've seen an astronaut with a knit cap.


they're not fantastic either if you pay close attention

there are mini-people in the 2060s market and in the cat one an extra paw comes out of nowhere


The woman’s legs move all weirdly too


While Sora might be able to generate short 60-90 second videos, how well it would scale with a larger prompt or a longer video remains yet to be seen. And the general logic of having the model do 90% of the work for you and then you edit what is required might be harder with videos.


60 seconds at a time is much better than enough.

Most fictional long-form video (whether live-action movies or cartoons, etc) is composed of many shots, most of them much shorter than 7 seconds, let alone 60.

I think the main factor that will be key to generate a whole movie is being able to pass some reference images of the characters/places/objects so they remain congruent between two generations.

You could already write a whole book in GPT-3 from running a series of one-short-chapter-at-a-time generations and passing the summary/outline of what's happened so far. (I know I did, in a time that feels like ages ago but was just early last year)

Why would this be different?


> I think the main factor that will be key to generate a whole movie is being able to pass some reference images of the characters/places/objects so they remain congruent between two generations.

I partly agree with this. The congruency however needs to extend to more than 2 generations. If a single scene is composed of multiple shots, then those multiple shots need to be part of the same world the scene is being shot in. If you check the video with the title `A beautiful homemade video showing the people of Lagos, Nigeria in the year 2056. Shot with a mobile phone camera.` the surroundings do not seem to make sense as the view starts with a market, spirals around a point and then ends with a bridge which does not fit into the market. If the the different shots generated the model did fit together seamlessly, trying to make the fit together is where the difficulty comes in. However I do not have any experience in video editing, so it's just speculation.


The CGI industry is about to be turned upside down. They charge hundreds of thousands per minute, and it takes them forever to produce the finished product.


You do realize virtually all movies are made up of shots often lasting no longer than 10 seconds. Edited together. Right.


The best films have long takes. Children of men or stalker come to mind


Copacabana tracking shot in Goodfellas




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: