> You are spot on. I've been involved in creating technologies used by film and video creators for decades, so I have some understanding of what would be useful to them. The best video AIs I've seen only seem capable of replacing some stock video clip creation because, so far, I haven't seen any ability to maintain robust consistency from shot to shot and scene to scene. There's also no granular control other than the text prompt. At first glance, these demos are very impressive but when you try to map the capability shown to a real production workflow for a movie, TV show or commercial, they're not even close because they aren't even trying to solve the problem.
Yeah it's really hard to get across to a lot of folks that are really amped up about these tools that what they're focused on refining is not getting them any closer to their imagined goal in most professional workflows. This will be great right off the bat for what most developers would need images for-- making a hero image for a blog post, making a blurb of video for a background, or a joke, or making assets for their video game that would never cut it for a non-cheapo commercial project but are better than what they'd have been able to cobble together themselves. But those workflows are fundamentally so different from the very first steps in the process. It's a larger-scale version of trying to explain to no-compromise FOSS zealots 20 years ago that Gimp was nowhere near able to replace Photoshop in a professional toolkit because they're completely disinterested in taking feedback about professional use cases, and that being able to write your own filters in Perl doesn't really help graphic designers-- well 20 years later, the gap is as wide as it ever has been, and there are even more people, almost exclusively FOSS nerds with to professional visual work experience, that insist it's better.
That said, it's nearly as hard to get this across to ADs who are like "what do you mean this shot is going to take you 3 days? I just made these stills which are like 70% there in midjourney in 10 minutes."
> To be clear, I think it's probably possible to create a video AI that would be truly useful in a real production workflow, it's just that I haven't seen anything working in that direction yet.
I think that neural networks, generally, are already fantastically useful in tools like Nuke's Copycat node. Nobody misses masking frame-by-frame if they don't have to do it. But prompt-based tools? Nah. If even 200 words in a prompt was enough information to convey work that needed to be done, why do creative workflows need so many revisions and why are there so many meetings with sketches and mood boards and concept art among career professionals? Text prompts are great for people that are working with a medium they don't really know how to create in because the real artistic decisions are already made by the artists whose artwork was ingested into the models. If you don't understand that level of nuance, you don't see how unbelievably consequential it is to the final product, and not having granular control of it seems nearly inconsequential. Most professionals look at it and see a toy because they know it will never be capable of making what they want it to make.
> neural networks, generally, are already fantastically useful in tools
Yes, I agree. You've highlighted the distinction I should have included of "prompt-based".
There's a vast gulf between these AI-researcher-based concept demos on one side and the NN-based features slowly getting implemented in real production tools. Like you, I've found it challenging to have constructive conversations about AI tooling with anyone not versed in real production workflows. To anyone with real industry experience it's obvious that so far these demos don't represent a threat to real production workflows or the skilled career professionals making a good living. It's not that they're not threatening, they're just threatening to replace a different type of job entirely. If you're one of the poor souls in an off-shore locale doing remote low-end piece-work like manning a stock photo/video clip farm or doing >$100 per piece gigs on Fivver - then, yeah, you should feel "threatened".
A meta-point I try to make in these conversations is that, at least so far, every actual paying creative job I've seen AI threaten are, IMHO, work I wouldn't wish on my worst enemy. These are low-paid entry-level sweatshop gigs and everyone doing them aspires to do something else as soon as they can. The two analogies I use are: 1) How the "threat" of robotics to jobs is actually playing out. So far, in industrial applications robots are replacing Amazon warehouse and manufacturing assembly line workers, literally today's equivalent of 1920s sweatshop work. Much like the heart-wrenching videos of children in Calcutta earning pennies sifting through junk piles for metal scraps, it'll be a better world when robots replace those jobs and humans have jobs designing, installing, programming and servicing the robots. Likewise in consumer robotics applications, so far, the robots in our house only vacuum the floors, change the cat litter box, and wash the dishes/clothes. Growing up my family spent a couple years living in Asia in the 1970s and we actually had a "wash ama" who came twice a week and washed our clothes manually with a washboard and a tub. Sounds quaint but in reality it was grueling labor. She was a lovely lady but I'm glad Maytag replaced that job.
The second analogy I often use is observing that self-driving cars are mainly a threat to Uber and Lyft drivers who often barely earn minimum wage and have no job security to start with. Career professionals actually working in real video and film production workflows feel as "threatened" by prompt-based AIs as Formula 1 drivers feel about self-driving cars. Why does current F1 champion Max Verstappen never get asked how he feels about AI self-driving cars coming for his job? :-) As you observed, anyone who understands the thousands of creative choices which comprise any shot in a quality film doesn't even see these prompt-based AI demos as relevant. Once you've heard a skilled cinematographer, colorist or director of photography spend over an hour deconstructing and debating the creative choices made in single shot or scene from a film, it's hard to even imagine these demos as a threat to that level of creative skill. But being able to crudely copy the traits of a composite of a thousand exemplars of the craft without understanding any of the interactions between those thousands of creative choices does make for impressive demos. Even though the fidelity of the crude copy is amazing, the fact is such shots are a random puree of a thousand different creative choices pulled from a thousand different great shots. That's the root of what unskilled people call the "AI-clip sheen". It won't be easy to eliminate from prompt-based clip generators because the nature of the NN is it doesn't understand the interactions of all those subtle creative choices it's aping. Mashing together one cinematographer's lens choice from one shot with another cinematographer's filter choice from another shot with a third cinematographer's film stock choice from another film and a colorist's palette from a fourth unrelated work and then training the output filter only against broad criteria like "looks good" or "like a high-quality art film" is not a strategy that, IMHO, will ever produce a true threat to skilled top-level production workflows.
At the same time, as you observed, NN's are already delivering tremendous value eliminating labor-intensive, repetitive manual production work like frame-by-frame rotoscoping and animation tweening, work no one actually in the industry is sorry to see humans being relieved of. While I think NN-based features in production tools will continue to expand the use cases they can assist, I'm not sure AI tools will ever completely replace high-skill production professionals. I've already mentioned the technical challenges based on how NNs work but even if these challenges are someday overcome, there's a more fundamental limitation which is economic. Although feature film, network-level television and high-end commercials have massive cultural reach and are huge industries, the overall economic value of the entire technical production workflow and related tooling isn't as large as most people imagine. From Panavision cameras, Zeiss film lenses and Adobe Premiere to Chapman camera cranes, Sachtler tripods and Kinoflow lights, it's a relatively small industry with no unicorn-level startups. Even assuming one could license all the necessary content and manually tag it, it's hard to imagine a viable business plan which justifies investing the hundreds of millions required to recruit top-level AI researchers, thousands of H100 GPUs, etc to create and train a tool that could really replace the top 1000 career production pros working in Hollywood. There are so many other markets AI can target which are potentially far more lucrative than high-end film and video production workflows. Even the handful of blockbuster Summer tent pole movies made each year that cost $200M to make only spend somewhere around $10M or $20M on production labor and tooling below the department head level. That's not enough money to fund AI replacement anytime in the foreseeable future. The total addressable market of high-end film and video production just isn't big enough to be an attractive target for investors to fund going after it.
I think the most vulnerable spots in the industry are in concept art and matte painting, though I also think companies are starting to realize it's not all its cracked up to be. A colleague that also contracts for [big famous FX and animation house we all know and love] said they fired their entire concept art department last year and replaced them with prompt jockeys.... for a few weeks. The prompters could bang out a million "great start" rough drafts in an hour, but then when their boss came around and inevitably said "oh, this one is the one to stick with. Just move this to the right and that to the left and make this bigger and that smaller and make this cloth purple" and they were cooked. They didn't even have the comparatively basic photoshop skills to do a hack job there, let alone make changes by hand-- so they'd struggle with control nets and inpainting and more prompts but the whole thing was one gigantic failure and they were begging the centuries of concept art expertise they unceremoniously booted out the door for forgiveness. And those workflows don't require anywhere near the control that, say, compositing does.
My biggest hope for the professional use of these things is in post-render pre-comp polishing for simulations and pyro. They're so good at understanding patterns and having smooth transitions that they can make a nonsense, physically absurd combination of images blend together perfectly... one of my favorites was a background guy's nose in a sepia toned video was neatly melded into a distant oncoming train. I think that could be really great for smoothing out volume textures and things like that. Given, that probably has more to do with my specialty than anything.
My main problem is that I'm just starting out my career in this field after switching from a decade of python dev work, and then doing some visual design before going to art school where I graduated at the top of my program having mostly concentrated in making cool shit at the Houdini/UE confluence. Two years ago everyone was saying "holy crap you've got the golden skillset," and now everyone's like "oof... hang in there... I guess..." Even aside from the strike aftermath, nobody in the market has any idea what to do right now, especially with juniors, let alone a really weird mixture of junior + senior dev that I am with a few contracts under my belt and a ton of really solid coding experience, but nothing really impressive in the industry itself. Who fucking knows. I think a lot of people in charge of hiring are waiting for a moment where it's going to just be sort of obvious what they need to do, and don't want to hire people into FTEs that are going to be eliminated through ai efficiency gains in 6 months. I don't have a lot of insight into the hiring side of the business though.
Wow, your story about the "FX and animation house" is funny, sad and unsurprising - all at the same time. I'm just surprised they didn't actually test the full workflow before leaping. It reminds me of this tale from actual production people working with Sora https://www.fxguide.com/fxfeatured/actually-using-sora/ which I also found completely unsurprising. It still took a team of three experience pros around two weeks to complete a very modest 90 second video and they needed to reduce their expectations to "making something out of the clips the AI gave us" instead of what they actually wanted. And even that reduced goal required using their entire toolbox of traditional VFX tools to modify the clips the AI generated to match each other well enough. Sure, it's early days and Sora is still pre-alpha but, while some of these problems are solvable with fine-tuning, retraining and adding extensive features for more granular control, some other aspects of these workflow gaps are fundamental to the nature of how NNs work. I suspect the bottom line is that solving some key parts of real-world high-end film/video workflows with the current prompt-based NNs is a case of "you can't get there from here."
For sure. Tooling on top of the core model functionality will absolutely increase the utility of the existing prompt-based workflows, too, but my gut says the diminishing returns on model training is going to keep the "good enough" goalposts much much further into the future with video than with text and still images.
Yeah it's really hard to get across to a lot of folks that are really amped up about these tools that what they're focused on refining is not getting them any closer to their imagined goal in most professional workflows. This will be great right off the bat for what most developers would need images for-- making a hero image for a blog post, making a blurb of video for a background, or a joke, or making assets for their video game that would never cut it for a non-cheapo commercial project but are better than what they'd have been able to cobble together themselves. But those workflows are fundamentally so different from the very first steps in the process. It's a larger-scale version of trying to explain to no-compromise FOSS zealots 20 years ago that Gimp was nowhere near able to replace Photoshop in a professional toolkit because they're completely disinterested in taking feedback about professional use cases, and that being able to write your own filters in Perl doesn't really help graphic designers-- well 20 years later, the gap is as wide as it ever has been, and there are even more people, almost exclusively FOSS nerds with to professional visual work experience, that insist it's better.
That said, it's nearly as hard to get this across to ADs who are like "what do you mean this shot is going to take you 3 days? I just made these stills which are like 70% there in midjourney in 10 minutes."
> To be clear, I think it's probably possible to create a video AI that would be truly useful in a real production workflow, it's just that I haven't seen anything working in that direction yet.
I think that neural networks, generally, are already fantastically useful in tools like Nuke's Copycat node. Nobody misses masking frame-by-frame if they don't have to do it. But prompt-based tools? Nah. If even 200 words in a prompt was enough information to convey work that needed to be done, why do creative workflows need so many revisions and why are there so many meetings with sketches and mood boards and concept art among career professionals? Text prompts are great for people that are working with a medium they don't really know how to create in because the real artistic decisions are already made by the artists whose artwork was ingested into the models. If you don't understand that level of nuance, you don't see how unbelievably consequential it is to the final product, and not having granular control of it seems nearly inconsequential. Most professionals look at it and see a toy because they know it will never be capable of making what they want it to make.