Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

All the vids have that instantly recognizable GenAI "sheen", for the lack of a better word. Also, I think the most obvious giveaway are all the micro-variations that happen along the edges, which give a fuzzy artifact.


I assure you that's not enough. These are high quality videos. Once they get uploaded to social media, compression mostly makes imperfections go away. And it's been shown that when people are not expecting AI content, they are much less likely to realize they are looking at AI. I would 100% believe most of these videos were real if caught off guard.


A friend who lives in North Carolina sent me a video of the raging floodwaters in his state- at least that's what the superimposed text claimed it was. When I looked closer, it was clearly an Indian city filled with Indian people and Indian cars. He hadn't noticed anything except the flood water. It reminded me of that famous selective attention test video[1]. I won't ruin it for those who haven't seen it, but it's amazing what details we can miss when we aren't looking for them. I suspect this is made even worse when we're casually viewing videos in a disjointed way as on social media and we're not even giving one part of the video our full attention.

1. https://www.youtube.com/watch?v=vJG698U2Mvo


For the entire duration of the Russia/Ukraine war "combat footage" that is actually from the video game ARMA 3 has gone viral fairly regularly, and now exactly the same thing is happening with Israel/Iran.


And which YouTube happily promotes straight to the top, of course -- thanks to the efforts of its rocket-science algorithm team. (Not sure whether the ones I've been seeing were generated by that particular platform, but YT does seem to promote obviously fake and deceptively labelled "combat" footage with depressing regularity).


The willingness of people to believe that combatants are wearing cinematic body cams for no tactical reason can only be matched by their willingness to assume people meticulously record every minute of their lives just so they can post a once-in-a-lifetime event on TikTok.

Who even needs AI generated videos when you can just act out absurdity and pretend it's real?


As far as I know, most of the viral stuff has been active air defence CWIS and the like which can be hard to discern.

There's a morbid path from the grainy Iraq war and earlier shaky footage, through IS propaganda which at the time had basically the most intense combat footage ever released to the Ukraine war. Which took it to the morbid end conclusion of endless drone video deaths and edited clips 30+ mins long with day long engagements and defending.

And yes, to answer your belief that there is none - there is loads of "cinematic body cam footage out there now".


Thousands of combatants are wearing bodycams, and pretty regularly, there are videos released by Russians of a dead Ukrainian's last moments taken from their corpse and the same happens vice versa.


Dude I clicked on some random Youtube accounts that were streaming the world cup live, and it took me a while to realize that they were actually just streaming video games replica of the actual game (at least, I think they were simulating the actual game with a video game, but I'm not sure as I didn't compare closely)


I've seen that a bunch of times, there's CGI highlights of most football matches.

I still don't know if it's autogenerated from the original video or recreated manually but yeah it's pretty realistic for the first few seconds.


Someone once did the opposite - streamed a real pay-per-view UFC match on Twitch and pretended it was a game he was playing. It actually worked for a while before the Twitch mods realized what was going on.

https://www.theverge.com/2017/12/4/16732912/ufc-video-stream...


It's kind of sad that we don't even need AI to create misinformation, the bar for what people will fall for is really low.


I've shown my own videos I made in dcs world to idiots at the bar in airports and they believed I was the ghost of kiev lmao


People believe false things easily if it confirms their priors. Confirmation bias is strong.

Fake images play into that, but they don't need to be AI generated for that to be true, it's been true forever.


And let's not forget the paper that goes with the video, which has a stellar title: http://www.chabris.com/Simons1999.pdf


hmmm... Maybe it's because I knew it was testing me, but I noticed it right away and counted the right count.

I could see it being pretty shocking if I hadn't, but I honestly can't imagine how I'd miss that.


It probably doesn't work if you're primed to look for hidden details. I took the test along with my Psychology 101 class of about 30 people and no one noticed anything amiss.


Once you see it you can indeed not imagine how you couldn't. Some people see it the first time, but it's a small amount of people. This video just demonstrates how humans can only focus at one thing at a time, and when we're multitasking, we're actually doing little parts of different tasks one at a time but very quickly after each other, kind of like a single CPU core. And if we tightly focus our attention to one point, we are not aware of other things that might be relatively close to that point.

That is also how magicians work, drawing your attention to one particular thing, hiding the secret of the trick from you, sometimes even in plain sight, like in the video.

Or pickpockets, who might bump into you and picking your pocket at the same time, where your attention is focussed on the sudden impact, keeping your attention away from your walled being taken.


> hmmm... Maybe it's because I knew it was testing me, but I noticed it right away and counted the right count.

> I could see it being pretty shocking if I hadn't, but I honestly can't imagine how I'd miss that.

The point of the video wasn't to count correctly, but to see the gorilla


99% the person was playing along for the rest of us, so we get a chance to enjoy the video as intended.


cool, he noticed it right away


I believe them. Why would people lie on the internet?


> I noticed it right away


I was focused on counting. I counted very wrong, but caught the gorilla right away.


If you see a text accompanying some content you can de-prime yourself by saying "nuh-uh, that's exactly what it's fscking not."


I do not see how the examples you mentioned are related to the topic? What does selective attention have to do with the video looking AI generated in all the frames?


Their argument is that if someone is affected by confirmation bias, they likely won’t notice these kinds of details.

Essentially, send me a video of something I care about and I will only look for that thing. Most people are not detectives, and even most would-be detectives aren’t yet experts.


probably people will soon develop a habit of verifying every detail in videos of interest haha


Cause people are well known for verifying every detail in most other forms of media already right?


Verifying what?


Indeed watching Reels or Tiktok videos is an exercise in testing your bullshit meter and commenting accordingly to let the uninformed know hey this is most likely fake.


Facebook is mostly this now too. Long comment threads of boomers thanking AI images for their military service or congratulating it on a long marriage.


> it's been shown that when people are not expecting AI content, they are much less likely to realize they are looking at AI.

At this point, looking at a big tech SoMe feed I would expect that everything is, or at least could be, gen AI content.


I regularly catch my kids watching AI generated content and they don't know it.


It's kind of an interesting phenomenon. I read something on this. Basically being born between ~1980 and ~1990 is a huge advantage in tech.


The only generation that ever knew how to set the clock on a VCR: our parents needed our help; our kids won't have even seen a VCR outside of a museum, much less used one.


Very interesting point. Wonder about the generation before and what skills they had to share with their parents who were most likely traumatised from a world war or two. I remember setting the vcr clock and tuning the new tv with the remote. I’m sure the adults could of figured it out but they probably got more from seeing their ‘smart’ kids figuring it out in time for the cartoons!


The parents of those of us who grew up in the 80's and 90's invented the VCR, they could use it just fine.


The Zoomers have the advantage that the bar is pretty low these days.


A surprising amount of it is really popular too. I recently figured out that the Movie Recaps channel was all AI when the generated voice slipped and mispronounced a word in a really unnatural way. They post videos almost daily and they get millions of views. Must be making bank.


A group I follow about hobby/miniatures (as in wargaming miniatures and dioramas) recently shared an "awesome" image of a diorama from another "hobby" group.

The image had all the telltale signs of being AI generated (too much detail, the lights & shadows were the wrong scale, the focus of the lens was odd for the kind of photo, etc). I checked that other group, and sure enough, they claim to be about sharing "miniature dioramas" but all they share is AI-generated crap.

And in the original group, which I'm a member of and is full of people who actually create dioramas -- let's say they are "subject matter experts" -- nobody suspected anything! To them, who are unfamiliar with AI art, the photo was of a real hand-made diorama.


I was watching UFC recaps on Youtube and the algorithm got me onto AI generated MMA content, I watched for a while before realizing it. They were using old videos which were "enhanced" using AI and had an AI narrator. I only realized it when the fight footage got so old, and the AI had to do so much work to touch it up, that artifacts started appearing in the video. Once I realized it I rewatched the earlier clips in the video and could see the artifacts there too, but not until I was looking for them.


There's already rabbit holes of fake MMA fighting you can fall into online? Even if you're a "fan" and relatively aware of what to look for ... still difficult to spot? Horribly, had the same sensation while watching UFC at a bar. "Haven't I seen this match where they fall on the ground and hug for hours before?" Mostly empty background audience with limited reactions.

Somebody took AI video editing, and in a year or two, we're already at entire MMA rabbit holes of fake videos.

Commenting mostly as a personal evidence reference of how crazy the World Wide Web has gotten from anecdotal sources.


Most probably they employ overseas, underpaid workers with non-standard English accents and so they include text-to-speach in the production process to smoothen the end result.

I won't argue wether text to speech qualifies as an AI but I agree they must be making bank.


I wonder if they are making bank. Seems like a race to the bottom, there’s no barrier to entry, right?


Right, content creators are in a race to the bottom.

But the people who position themselves to profit from the energy consumption of the hardware will profit from all of it: the LLMs, the image generators, the video generators, etc. See discussion yesterday: https://news.ycombinator.com/item?id=41733311

Imagine the number of worthless images being generated as people try to find one they like. Slop content creators iterate on a prompt, or maybe create hundreds of video clips hoping to find one that gets views. This is a compute-intensive process that consumes an enormous amount of energy.

The market for chips will fragment, margins will shrink. It's just matrix multiplication and the user interface is PyTorch or similar. Nvidia will keep some of its business, Google's TPUs will capture some, other players like Tenstorrent (https://tenstorrent.com/hardware/grayskull) and Groq and Cerebras will capture some, etc.

But at the root of it all is the electricity demand. That's where the money will be made. Data centers need baseload power, preferably clean baseload power.

Unless hydro is available, the only clean baseload power source is nuclear fission. As we emerge from the Fukushima bear market where many uranium mining companies went out of business, the bottleneck is the fuel: uranium.


You spent a lot of words to conclude that energy is the difference maker between modern western standards of living and whatever else there is and has been.


Ok, too many words. Here's a summary:

Trial and error content-creation using generative AI, whether or not it creates any real-world value, consumes a lot of electricity.

This electricity demand is likely to translate into demand for nuclear power.

When this demand for nuclear power meets the undersupply of uranium post-Fukushima, higher uranium prices will result.


Continuing that thought, higher uranium prices and real demand will lead to unshuttering and exploiting known and proven deposits that are currently idle and increase exploration activity of known resources to advance their status to measured and modelled for economic feasiblity, along with revisiting radiometric maps to flag raw prospects for basic investigation.

More supply and lower prices will result.

Not unlike the recent few years in (say) lithium, anticipated demand surged exploration and development, actual demand didn't meet anticipated demand and a number of developed economicly feasible resources were shuttered .. still waiting in the wings for a future pickup in demand.


Spend a few months studying existing demand (https://en.wikipedia.org/wiki/List_of_commercial_nuclear_rea...), existing supply (mines in operation, mines in care and maintenance, undeveloped mines), and the time it takes to develop a mine. Once you know the facts we can talk again.

Look at how long NexGen's Rook 1 Arrow is taking to develop (https://s28.q4cdn.com/891672792/files/doc_downloads/2022/03/...). Spend an hour listening to what Cameco said in its most recent conference call. Look at Kazatomprom's persistent inability to deliver the promised pounds of uranium, their sulfuric acid shortages and construction delays.

Uranium mining is slow and difficult. Existing demand and existing supply are fully visible. There's a gap of 20-40 million pounds per year, with nothing to fill the gap. New mines take a decade or more to develop.

It is not in the slightest like lithium.


> Spend a few months studying existing demand

Would two decades in global exploration geophysics and being behind the original incarnation of https://www.spglobal.com/market-intelligence/en/industries/m... count?

> Once you know the facts we can talk again.

Gosh - that does come across badly.


Apologies.

When someone compares uranium to lithium, I know I'm not talking to a uranium expert.

All the best to you, and I'll try to be more polite in the future.


Weird .. and to think I spent several million line kms in radiometric surveys, worked multiple uranium mines, made bank on the 2007 price spike and that we published the definite industry uranium resources maps in 2006-2010.

Clearly you're a better expert.

> when someone compares uranium to lithium, I know I'm not talking to a uranium expert.

It's about boom bust and shuttering cycles that apply in all resource exploration and production domains.

Perhaps you're a little too literal for analogies? Maybe I'm thinking in longer time cycles than yourself and don't a few years of lag as anything other than a few years.


Once again, allow me to offer my sincere apologies.

You are well-prepared to familiarize yourself with the current supply/demand situation. It's time to "make bank", just like you did in 2007... only more so. The 2007 spike was during an oversupplied uranium market and mainly driven by financial actors.

I invite you to begin by listening to any recent interview with Mike Alkin.

Good night and enjoy your weekend.


> Most probably they employ overseas, underpaid workers with non-standard English accents and so they include text-to-speach in the production process to smoothen the end result.

Might also be an AI voice-changer (i.e. speech2speech) model.

These models are most well-known for being used to create "if [famous singer] performed [famous song not by them]" covers — you sing the song yourself, then run your recording through the model to convert the recording into an equivalent performance in the singer's voice; and then you composite that onto a vocal-less version of the track.

But you can just as well use such a model to have overseas workers read a script, and then convert that recording into an "equivalent performance" in a fluent English speaker's voice.

Such models just slip up when they hit input phonemes they can't quite understand the meaning of.

(If you were setting this up for your own personal use, you could fine-tune the speech2speech model like a translation model, so it understands how your specific accent should map to the target. [I.e., take a bunch of known sample outputs, and create paired inputs by recording your own performances of them.] This wouldn't be tenable for a big low-cost operation, of course, as the recordings would come from temp workers all over the world with high churn.)


Can you identify any of these models?


I think it's unusual to assume they are based in the US and employ/underpay foreigners. A lot of people making the content are just from somewhere else.


But it uses AI only for audio, right? Script for the vid seems to be written by human, given the unusual humor type of this channel. I started watching this channel some time ago.


It's hard to tell whether they use AI for script generation. After having seen enough of those recaps, the humor seems to be rather mechanical and basic humor is relatively easy to get from an LLM if prompted correctly. The video titles also seem as if they were generated.

That said, this channel has been producing videos well before ChatGPT3.5/4 so at the very least they probably started with human written scripts.


I thought it was just text to speech when I happen to saw some of those videos. And it seems to have been consistently similar since before ChatGPT etc. Why do you think titles are AI generated?

I feel like it might actually be quite complex for AI to pull up the perfect clips and edit them with the script, including timing and everything. Maybe it could be made automatic, but nonetheless it would be a complex process and I don't think possible few years ago. I know Gemini and possibly some others can analyze video if fed to them, but I'm still skeptical that this channel in particular would have done it, when they have always had this frequency of uploads and similar tone.

Also I think there's far better TTS now with ElevenLabs and others so it could be made much more human like.


The way I see it, it won’t take long before human eyes won’t be able to distinguish AI generated content from original.

The only regret I have about that is losing video as a form of evidence. CCTV footage and the like are a valuable tool for solving crimes. That’s going to be out the window soon.


Trust can be preserved by adding PKI at the hardware level. What you said about CCTV is true; once the market realises and demand appears, camera manufacturers will start making camera modules that, e.g., sign each frame with the manufacturer's private key, enabling Joe Public to verify that that frame came from a camera made by that manufacturer. Reputational risk makes the manufacturer store the private key in the device in a secure, tamper-proof way (like TPMs do now), which (mostly) prevents those private keys from leaking.

Does this create difficulties if you want to modify the raw video data in any way? Yes it does, even if you just want to save it in a different lossy compression level or format. But these problems aren't insurmountable. Essentially, provenance info can be added for each modification, signed by the entity that made the change, and the end viewer can then decide if they trust the full certificate chain (just as they do now with HTTPS).


Oh wow, that's a great idea. Isn't this already happening maybe?

Recently someone said here that it's noticable that videos from CCTV cameras are often filmed with a phone or camera on a screen instead of using the original video, and people were speculating that it might be hard or impossible to get access to the original recording because of bureaucracy or something, but that recording a playback on a screen with a phone or camera or something is then often allowed. Maybe they also do this partly so that the original can't be easily messed with by other people.

But yeah if you can verify that a certain video was filmed at a certain time by a certain camera, that is great. Of course the companies providing these cameras should be trustworthy, and that the camera's are actually really sending what they actually record, and that the company itself doesn't mess with the original recordings.


>Isn't this already happening maybe?

I recall an article posted 1-2 years ago about a camera company (Kodak? Can't remember) which was starting to offer something along these lines.

>the companies providing these cameras should be trustworthy, and that the camera's are actually really sending what they actually record, and that the company itself doesn't mess with the original recordings.

I agree. We can't guarantee any of these things, but on the bright side, the incentives are pointing in the right direction to make self-interested companies choose to behave the right way.

It will complicate things and make the hardware more expensive, so I doubt it will sweep through all consumer camera tech unless the "Is this photo real?" question becomes a crisis. There's also the fact that it would be possible to give individual cameras different private keys, with certificates signed by the manufacturer: This would enable non-repudiation (you would not be able to plausibly deny that you had taken a particular photo/video), which has potentially big upsides but also privacy downsides. I think that could be solved by giving the user the option of signing with their unique camera private key (when the user wants to prove to others that they took the photo themselves) or with the manufacturer's key (when they want to remain anonymous).


It's sad that almost AS SOON as we acquired the ability to record real-life moments (with the promise of being able to share undeniable evidence of events with one another), we also acquired the ability to doctor it, negating that promise.


I'm not sure we should have been trusting images for the previous decades either. Photoshop has been a thing for a long time already. I mean, there's those famous photos that Stalin had people removed from.


Your mention of Stalin is I think stronger as an argument that there’s been a significant change. Those fakes took lots of time by skilled humans and were notoriously obvious - what made them effective was the crushing political power preventing them from receiving critical analysis in public.

Similarly, while Photoshop made it easier it happened at a time where technical advances made the problem harder because everyone’s standards for photos went up dramatically, and so producing a realistic fake was still a slow process for a skilled worker.

Now, it’s increasingly available to everyone and that means that we’re going to see a lot more scams and hoaxes as people without artistic talent or willingness to invest time can make realistic fakes even for minor things. That availability is transformative enough to merit the concern we’ve been seeing here.


The glass half-full in me feels that the advantage to this is that in a few years the average person will know better than to trust anything that could be faked like that, instead of the old situation where someone who was willing to put in that effort could actually trick a lot of people.


I think that’s true, but it’s kind of like the trade offs during the pandemic where we knew it would eventually settle into a stable state but still wanted to reduce the harm getting there. We basically need some large fraction of the global population to level up in media literacy at all once.


I don't think it goes out the window completely. You need just the owner of the CCTV to stand up in court and say "yes this is the CCTV footage I personally copied from storage and I did not manipulate it".


> compression mostly makes imperfections go away

The ultimate compression is to reduce the video clip to a latent space vector representation to be rendered on device. :)

Just give us a few more revs of Moore’s law for that to be reasonable.

edit: found a patent… https://patents.google.com/patent/US11388416B2/en


That sheen looks (to me) like some of the filters that are used by people who copy videos from TV and movie and post them on (for example) facebook reels.

There's an entire pattern of reels that are basically just ripped-off-content with enough added noise to (I presume) avoid content detection filters. Then the comments have links to scam sites (but are labelled as "the IMDB page for this content").


The idea that Meta’s effectively stolen content is tainted by a requirement to avoid collecting stolen content is laughably ironic.


Yes, but thats just a hypothesis, have we seen any evidence that shows the cause of the "AI sheen" is bad training data, or more likly, just a shortcomming of generating realistic photos from text at this early stage.


I thought the movements were off. The little girl on the beach moves like an adult, the painter looks like a puppet, and everything is in slow motion?


They look like some commercial promo video, which makes sense since that's probably what they were trained on.


To me they seem off, but off in the same sense real humans in ads always seem off. E.g. the fake smile of the smiling girl. That's what people look like in ads.


At least all the humans in these videos seem to have the correct number of fingers, so that's progress. And Moo Deng seems to have a natural sheen for some reason so can't hold that against them. But your point about the edges is still a major issue.


I wonder how much RLHF or other human tweaking of the models contributes to this sort of overstauration / excess contrast in the first place. The average consumer seems to prefer such features when comparing images/video, and use it as a heuristic for quality. And there have been some text-to-image comparisons of older gen models to newer gen, purporting that the older, more hands-off models didn't skew towards kitschy and overblown output the way newer ones do.


>All the vids have that instantly recognizable GenAI "sheen"

That's something that can be fixed in a future release or you can fix it right now with some filters in post in your pipeline.


I think the big blind spot people have with these models is that the release pages only show just the AI output. But anyone competently using these AI tools will be using them in step X of a hundred step creative process. And it's only going to get worse as both the AI tools improve and people find better ways to integrate them into their workflow.


Yeah exactly. Video pipelines that go into productions we only see the end results of have a lot of steps to them beyond just the raw video output/capture. Even Netflix/Hollywood productions without VFX have a lot of retouching and post processing to them.


Not even filters; every text2image model ever created thusfar, can be very easily nudged with a few keywords into generating outputs in a specific visual style (e.g. artwork matching the signature style of any artist it has seen the some works from.)

This isn't an intentional "feature" of these models; rather, it's kind of an inherent part of how such models work — they learn associations between tokens and structural details of images. Artists' names are tokens like any other, and artists' styles are structural details like any other.

So, unless the architecture and training of this model are very unusual, it's gonna at least be able to give you something that looks like e.g. a "pencil illustration."


> "That's something that can be easily fixed in a future release (...)"

This has been the default excuse for the last 5+ years. I won't hold my breath.


5 years ago there were no AI videos. A bit over a year ago the best AI videos were hilarious hallucinations of Will Smith eating spaghetti.

Today we have these realistic videos that are still in the uncanny valley. That's insane progress in the span of a year. Who knows what it will be like in another year.

Let'em cook.


Disco Diffusion was a (bad) thing in 2021 that lead to the spaghetti video / weird Burger Kind Ads level quality. But it ran on consumer GPUs / in Jupyter notebook.

2 years ago we had decent video generation for clips

7 months ago we have Sora https://news.ycombinator.com/item?id=39393252 (still silence since then)

With these things, like DALL-E 1 and GPT-3, the original release of the game changer often comes ca. 2 years before people can actually use it. I think that's what we're looking at.

I.e. it's not as fast as you think.


What video generation was decent 2 years ago? Will smith eating spaghetti was barely coherent and clearly broken, and that was March 2023 (https://knowyourmeme.com/memes/ai-will-smith-eating-spaghett...).

And isn’t this model open source…? So we get access to it, like, momentarily? Or did I miss something?


So you're right to be excited, I agree. And I don't know, Meta, like OpenAI, seems to release conditionally, though yes, more. I doubt it before the election.

When the Will Smith one was released, it was kind of a parody though. Tech had already been able to produce that level of "quality" for about 2 years at the time of it's publishing. The Will Smith one is honestly something you could have created with Disco Diffusion in early 2021, I used to do this back then...

2022 saw: https://makeavideo.studio/ (coherent, but low res - it was possible to upscale at extreme expense) https://sites.research.google/phenaki/ https://lumiere-video.github.io/

It was more like 18-20 months ago sorry so early 2023, but https://runwayml.com/research/gen-1 was getting there as was https://pika.art/home - Sora obviously changed the game, but I would say these two were great.


The subtle "errors" are all low hanging fruit. It reminds me of going to SIGGRAPH years back and realizing most of the presentations were covering things which were almost imperceptible when looking at the slides in front. The math and the tech was impressively, but qualitatively it might have not even mattered.

The only interesting questions now have nothing to do with capability but with economics and raw resources.

In a few years, or less, clearly we'll be able to take our favorite books and watch unabridged, word-for-word copies. The quality, acting, and cinematography will rival the biggest budget Hollywood films. The "special effects" won't look remotely CG like all of the newest Disney/Marvel movies -- unless you want them to. If publishers put up some sort of legal firewall to prevent it, their authors, characters, and stories will all be forgotten.

And if we can spend $100 of compute and get something I described above, why wouldn't Disney et al throw $500m at something to get even more out of it, and charge everyone $50? Or maybe we'll all just be zoo animals soon (Or the zoo animals will have neuralink implants and human level intelligence, then what?)


> In a few years, or less, clearly we'll be able to take our favorite books and watch unabridged, word-for-word copies. The quality, acting, and cinematography will rival the biggest budget Hollywood films. The "special effects" won't look remotely CG like all of the newest Disney/Marvel movies -- unless you want them to. If publishers put up some sort of legal firewall to prevent it, their authors, characters, and stories will all be forgotten.

I'm also expecting, before 2030, that video game pipelines will be replaced entirely. No more polygons and textures, not as we understand the concepts now, just directly rendering any style you want, perfectly, on top of whatever the gameplay logic provided.

I might even get that photorealistic re-imagining of Marathon 2 that I've been wanting since 1997 or so.


> In a few years, or less, clearly we'll be able to take our favorite books and watch unabridged, word-for-word copies. The quality, acting, and cinematography will rival the biggest budget Hollywood films. The "special effects" won't look remotely CG like all of the newest Disney/Marvel movies -- unless you want them to. If publishers put up some sort of legal firewall to prevent it, their authors, characters, and stories will all be forgotten.

I don't think so at all. You're thinking a movie is just the end result that we watch in theaters. Good directing is not a text prompt, good editing is not a text prompt, good acting is not a text prompt. What you'll see in a few years is more ads. Lots of ads. People who make movies aren't salivating at this stuff but advertising agencies are because it's just bullshit content meant to distract and be replaced by more distractions.


Indeed, adverts come first.

But at the same time, while it is indeed true that the end result is far more than simply just making good images, LLMs are weird interns at everything — with all the negative that implies as well as the positive, so they're not likely to produce genuinely award winning content all by themselves even though they can do better by asking them for something "award winning" — so it's certainly conceivable that we'll see AI indeed do all these things competently at some point.


> "In a few years, or less, clearly we'll be able to take our favorite books and watch unabridged, word-for-word copies."

That would be a boring movie.


You had AI videos 5 years ago?


AI in general.


…I mean, it was advancing slowly for linguistic tasks until late 2022, that’s fair. That’s why we’re in such a crazy unexpected rollercoaster of an era - we accidentally cracked intuitive computing while trying to build the best text autocomplete.

AI in general is from 1950, or more generally from whenever the abacus was invented. This very website runs on AI, and always has. I would implore us to speak more exactly if we’re criticizing stuff; “LLMs” came around (in force) in 2023, both for coherent language use (ChatGPT 3.5) and image use (DALLE2). The predecessors were an order of magnitude less capable, and going back 5 years puts us back in the era of “chatbots”, aka dumb toys that can barely string together a Reddit comment on /r/subredditsimulator.


AI so far has given us ability to mass produce shit content of no use to anybody and the next iteration of customer support phone menu trees that sound more convincingly yet remain just as useless. That and another round of IP theft and mass surveillance in the name of progress.


This is a consequence of a type of cognitive bias - bad examples of AI are more easily detectable than good examples of AI. Subsequently, when we recall examples of AI content, bad examples are more easily accessible. This leads to the faulty conclusion that.

> AI so far has given us ability to mass produce shit content of no use to anybody

Good AI goes largely undetected, for the simple reason that it closely matches the distribution of non-AI content.

Controversial aside: This is same bias that results in non-passing trans people being representative of the whole. Passing trans folk simply blend in.


This basic concept can be applied in many places. Do you ever wonder why social movements seem to never work out well and demands are never met? That’s because when they do work out, and demands are met, those people quickly become the “oppressor” or the powerful class from which others are fighting to receive more rights or money.

All criminals seem so incredibly stupid that you can’t understand why anyone would ever try since they all are caught? The smart ones don’t get caught and no one ever hears about them.


You're making an unverifiable claim. How are we supposed to know that the undetected good AI exists at all? Everything I've seen explicitly produced by any of these models is in uncanny valley territory still, even the "good" stuff.


Don't care. Every request for verification will eventually reach the Münchhausen trilemma


Okay. So you are a person who does not care if what they are saying is true. Got it!


Verificationism[1] is a failed epistemology because it breaks under the Münchhausen trilemma. It's pseudo-scientific like astrology, four humors, and palm reading. Use better epistemologies.

https://en.wikipedia.org/wiki/Verificationism


The core use case is as a small part of larger programs. It’s just computer vision but for words :)


We don't have AI in general today


I'm thankful to be able to recognize that sheen, though I think it will go away soon enough


I don't think that's a bug. I think that helps us separate truth from fiction as we navigate the transition to this new world.


Ever heard of post processing? Because no, you can't trust these signals to always exist with AI content.


It is maybe recognizable in most cases, but definitely not instantly nor easily. I could definitely see nobody noticing one of those clips used in an otherwise non-AI video production.


I did some images generation and found a LORA for VHS footage. It's amazing what "taking away the sheen" can do to make an image look strikingly real.


The ATV turning in mid air was a giveaway as well. Physics seems to be a basic problem for these type of videos.


The bubble released into the air is also pretty good until at the end where bubbles appear out of thin air.

But overall the physics are surprisingly good. In the videos from text we a person moving covered in a bedsheet, a mirror doing vaguely mirror-like things, a monkey moving in water and creating plausible waves, shadows moving over a 3d object with the sloth in the pool and plausible fire. Those are all classic topics to tackle in computer-generated graphics, all casually handled by a model that isn't explicitly trained on physical simulation.

In a twist of irony it's the simplest of those (the mirror) that's the most obviously wrong.


Video autotune.


A lot look like CGI, but I wouldn't be able to tell that they weren't created by an actual animator.


I think that's because they're still using mean-squared error in their loss function.


Yeah but... it's good enough?

There were movies with horrible VFX that still sold perfectly well at the time.


An important contrast is that early VFX offered strong control with weak fidelity, and these prompt-based AI systems offer high fidelity with weak control. Intent matters if you want to make something more than a tech demo or throwaway B-roll and you can't communicate much intent in a 30 word prompt, assuming the model even follows the prompt accurately.


This is such an important problem of the entire genAI idea. It's absurd that people keep focusing on quality instead of talking about it.

But then, a lot of people have financial reasons to ignore the problem. What's too bad, because it's hindering the creation of stuff that are actually useful.


> AI systems offer high fidelity with weak control

You are spot on. I've been involved in creating technologies used by film and video creators for decades, so I have some understanding of what would be useful to them. The best video AIs I've seen only seem capable of replacing some stock video clip creation because, so far, I haven't seen any ability to maintain robust consistency from shot to shot and scene to scene. There's also no granular control other than the text prompt. At first glance, these demos are very impressive but when you try to map the capability shown to a real production workflow for a movie, TV show or commercial, they're not even close because they aren't even trying to solve the problem.

To be clear, I think it's probably possible to create a video AI that would be truly useful in a real production workflow, it's just that I haven't seen anything working in that direction yet.


> You are spot on. I've been involved in creating technologies used by film and video creators for decades, so I have some understanding of what would be useful to them. The best video AIs I've seen only seem capable of replacing some stock video clip creation because, so far, I haven't seen any ability to maintain robust consistency from shot to shot and scene to scene. There's also no granular control other than the text prompt. At first glance, these demos are very impressive but when you try to map the capability shown to a real production workflow for a movie, TV show or commercial, they're not even close because they aren't even trying to solve the problem.

Yeah it's really hard to get across to a lot of folks that are really amped up about these tools that what they're focused on refining is not getting them any closer to their imagined goal in most professional workflows. This will be great right off the bat for what most developers would need images for-- making a hero image for a blog post, making a blurb of video for a background, or a joke, or making assets for their video game that would never cut it for a non-cheapo commercial project but are better than what they'd have been able to cobble together themselves. But those workflows are fundamentally so different from the very first steps in the process. It's a larger-scale version of trying to explain to no-compromise FOSS zealots 20 years ago that Gimp was nowhere near able to replace Photoshop in a professional toolkit because they're completely disinterested in taking feedback about professional use cases, and that being able to write your own filters in Perl doesn't really help graphic designers-- well 20 years later, the gap is as wide as it ever has been, and there are even more people, almost exclusively FOSS nerds with to professional visual work experience, that insist it's better.

That said, it's nearly as hard to get this across to ADs who are like "what do you mean this shot is going to take you 3 days? I just made these stills which are like 70% there in midjourney in 10 minutes."

> To be clear, I think it's probably possible to create a video AI that would be truly useful in a real production workflow, it's just that I haven't seen anything working in that direction yet.

I think that neural networks, generally, are already fantastically useful in tools like Nuke's Copycat node. Nobody misses masking frame-by-frame if they don't have to do it. But prompt-based tools? Nah. If even 200 words in a prompt was enough information to convey work that needed to be done, why do creative workflows need so many revisions and why are there so many meetings with sketches and mood boards and concept art among career professionals? Text prompts are great for people that are working with a medium they don't really know how to create in because the real artistic decisions are already made by the artists whose artwork was ingested into the models. If you don't understand that level of nuance, you don't see how unbelievably consequential it is to the final product, and not having granular control of it seems nearly inconsequential. Most professionals look at it and see a toy because they know it will never be capable of making what they want it to make.


> neural networks, generally, are already fantastically useful in tools

Yes, I agree. You've highlighted the distinction I should have included of "prompt-based".

There's a vast gulf between these AI-researcher-based concept demos on one side and the NN-based features slowly getting implemented in real production tools. Like you, I've found it challenging to have constructive conversations about AI tooling with anyone not versed in real production workflows. To anyone with real industry experience it's obvious that so far these demos don't represent a threat to real production workflows or the skilled career professionals making a good living. It's not that they're not threatening, they're just threatening to replace a different type of job entirely. If you're one of the poor souls in an off-shore locale doing remote low-end piece-work like manning a stock photo/video clip farm or doing >$100 per piece gigs on Fivver - then, yeah, you should feel "threatened".

A meta-point I try to make in these conversations is that, at least so far, every actual paying creative job I've seen AI threaten are, IMHO, work I wouldn't wish on my worst enemy. These are low-paid entry-level sweatshop gigs and everyone doing them aspires to do something else as soon as they can. The two analogies I use are: 1) How the "threat" of robotics to jobs is actually playing out. So far, in industrial applications robots are replacing Amazon warehouse and manufacturing assembly line workers, literally today's equivalent of 1920s sweatshop work. Much like the heart-wrenching videos of children in Calcutta earning pennies sifting through junk piles for metal scraps, it'll be a better world when robots replace those jobs and humans have jobs designing, installing, programming and servicing the robots. Likewise in consumer robotics applications, so far, the robots in our house only vacuum the floors, change the cat litter box, and wash the dishes/clothes. Growing up my family spent a couple years living in Asia in the 1970s and we actually had a "wash ama" who came twice a week and washed our clothes manually with a washboard and a tub. Sounds quaint but in reality it was grueling labor. She was a lovely lady but I'm glad Maytag replaced that job.

The second analogy I often use is observing that self-driving cars are mainly a threat to Uber and Lyft drivers who often barely earn minimum wage and have no job security to start with. Career professionals actually working in real video and film production workflows feel as "threatened" by prompt-based AIs as Formula 1 drivers feel about self-driving cars. Why does current F1 champion Max Verstappen never get asked how he feels about AI self-driving cars coming for his job? :-) As you observed, anyone who understands the thousands of creative choices which comprise any shot in a quality film doesn't even see these prompt-based AI demos as relevant. Once you've heard a skilled cinematographer, colorist or director of photography spend over an hour deconstructing and debating the creative choices made in single shot or scene from a film, it's hard to even imagine these demos as a threat to that level of creative skill. But being able to crudely copy the traits of a composite of a thousand exemplars of the craft without understanding any of the interactions between those thousands of creative choices does make for impressive demos. Even though the fidelity of the crude copy is amazing, the fact is such shots are a random puree of a thousand different creative choices pulled from a thousand different great shots. That's the root of what unskilled people call the "AI-clip sheen". It won't be easy to eliminate from prompt-based clip generators because the nature of the NN is it doesn't understand the interactions of all those subtle creative choices it's aping. Mashing together one cinematographer's lens choice from one shot with another cinematographer's filter choice from another shot with a third cinematographer's film stock choice from another film and a colorist's palette from a fourth unrelated work and then training the output filter only against broad criteria like "looks good" or "like a high-quality art film" is not a strategy that, IMHO, will ever produce a true threat to skilled top-level production workflows.

At the same time, as you observed, NN's are already delivering tremendous value eliminating labor-intensive, repetitive manual production work like frame-by-frame rotoscoping and animation tweening, work no one actually in the industry is sorry to see humans being relieved of. While I think NN-based features in production tools will continue to expand the use cases they can assist, I'm not sure AI tools will ever completely replace high-skill production professionals. I've already mentioned the technical challenges based on how NNs work but even if these challenges are someday overcome, there's a more fundamental limitation which is economic. Although feature film, network-level television and high-end commercials have massive cultural reach and are huge industries, the overall economic value of the entire technical production workflow and related tooling isn't as large as most people imagine. From Panavision cameras, Zeiss film lenses and Adobe Premiere to Chapman camera cranes, Sachtler tripods and Kinoflow lights, it's a relatively small industry with no unicorn-level startups. Even assuming one could license all the necessary content and manually tag it, it's hard to imagine a viable business plan which justifies investing the hundreds of millions required to recruit top-level AI researchers, thousands of H100 GPUs, etc to create and train a tool that could really replace the top 1000 career production pros working in Hollywood. There are so many other markets AI can target which are potentially far more lucrative than high-end film and video production workflows. Even the handful of blockbuster Summer tent pole movies made each year that cost $200M to make only spend somewhere around $10M or $20M on production labor and tooling below the department head level. That's not enough money to fund AI replacement anytime in the foreseeable future. The total addressable market of high-end film and video production just isn't big enough to be an attractive target for investors to fund going after it.


I think the most vulnerable spots in the industry are in concept art and matte painting, though I also think companies are starting to realize it's not all its cracked up to be. A colleague that also contracts for [big famous FX and animation house we all know and love] said they fired their entire concept art department last year and replaced them with prompt jockeys.... for a few weeks. The prompters could bang out a million "great start" rough drafts in an hour, but then when their boss came around and inevitably said "oh, this one is the one to stick with. Just move this to the right and that to the left and make this bigger and that smaller and make this cloth purple" and they were cooked. They didn't even have the comparatively basic photoshop skills to do a hack job there, let alone make changes by hand-- so they'd struggle with control nets and inpainting and more prompts but the whole thing was one gigantic failure and they were begging the centuries of concept art expertise they unceremoniously booted out the door for forgiveness. And those workflows don't require anywhere near the control that, say, compositing does.

My biggest hope for the professional use of these things is in post-render pre-comp polishing for simulations and pyro. They're so good at understanding patterns and having smooth transitions that they can make a nonsense, physically absurd combination of images blend together perfectly... one of my favorites was a background guy's nose in a sepia toned video was neatly melded into a distant oncoming train. I think that could be really great for smoothing out volume textures and things like that. Given, that probably has more to do with my specialty than anything.

My main problem is that I'm just starting out my career in this field after switching from a decade of python dev work, and then doing some visual design before going to art school where I graduated at the top of my program having mostly concentrated in making cool shit at the Houdini/UE confluence. Two years ago everyone was saying "holy crap you've got the golden skillset," and now everyone's like "oof... hang in there... I guess..." Even aside from the strike aftermath, nobody in the market has any idea what to do right now, especially with juniors, let alone a really weird mixture of junior + senior dev that I am with a few contracts under my belt and a ton of really solid coding experience, but nothing really impressive in the industry itself. Who fucking knows. I think a lot of people in charge of hiring are waiting for a moment where it's going to just be sort of obvious what they need to do, and don't want to hire people into FTEs that are going to be eliminated through ai efficiency gains in 6 months. I don't have a lot of insight into the hiring side of the business though.


Wow, your story about the "FX and animation house" is funny, sad and unsurprising - all at the same time. I'm just surprised they didn't actually test the full workflow before leaping. It reminds me of this tale from actual production people working with Sora https://www.fxguide.com/fxfeatured/actually-using-sora/ which I also found completely unsurprising. It still took a team of three experience pros around two weeks to complete a very modest 90 second video and they needed to reduce their expectations to "making something out of the clips the AI gave us" instead of what they actually wanted. And even that reduced goal required using their entire toolbox of traditional VFX tools to modify the clips the AI generated to match each other well enough. Sure, it's early days and Sora is still pre-alpha but, while some of these problems are solvable with fine-tuning, retraining and adding extensive features for more granular control, some other aspects of these workflow gaps are fundamental to the nature of how NNs work. I suspect the bottom line is that solving some key parts of real-world high-end film/video workflows with the current prompt-based NNs is a case of "you can't get there from here."


For sure. Tooling on top of the core model functionality will absolutely increase the utility of the existing prompt-based workflows, too, but my gut says the diminishing returns on model training is going to keep the "good enough" goalposts much much further into the future with video than with text and still images.


Just need to wait for someone to develop a version of ControlNet that works with this system.


Yeah controlnet-style conditioning doesn't solve for consistent assets, or lighting, framing etc. Maybe its early but it seems hard to get around traditional 3D assets + rendering, at least for serious use-cases.

These models do seem like they could be great photorealism/stylization shaders. And they are also pretty good at stuff like realistic explosions, fluid renders etc. That stuff is really hard with CG.


Yeah, that's a fair point.


It's my understanding that the AI sheen is done on purpose to give people a "tell". It is totally possible right now to at least generate images with no discernible tell.


> It is totally possible right now to at least generate images with no discernible tell.

I have yet to find examples of this


There are numerous tricks and LORAs to make realistic images without the overpolish you get by default:

* https://www.reddit.com/gallery/1fvs0e1

* https://old.reddit.com/r/StableDiffusion/comments/1fak0jl/fi...


Haha, I think I can maybe tell on like one or two of those


In the linked webpage, the following videos would be good enough to trick me:

- The monkey in hotspring video, if not for its weird beard...

- The koala video I would have mistaken for hollywood-quality studio CGI (although I would know it's not real because koalas don't surf... do they?)

- The pumpkin video if played at 1/4 resolution and 2x speed

- The dog-at-Versailles video edit

If the videos are that good, I'm sure I already can't distinguish between real photos and the best AI images. For example, ThisPersonDoesNotExist isn't even very recent, but I wouldn't be able to tell whether most of its output is real or not, although it's limited to a certain style of close-up portrait photography.

https://this-person-does-not-exist.com/en


> limited to a certain style of close-up portrait photography

Not to take away from your point but it's more limited than one might think from this phrase. As an exercise, open that page and scroll so the full image is on your screen, then hover your mouse cursor within the iris of one of the eyes, refresh and scroll again. (Edit: I just noticed there's a delayed refresh button on the page, so one can click that and then move their mouse over the eye to skip a full page refresh.) I've yet to see a case where my mouse cursor is not still in line with the iris of the next not-person.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: