Hacker News new | past | comments | ask | show | jobs | submit login
Generating audio for video (deepmind.google)
135 points by rvnx 10 months ago | hide | past | favorite | 44 comments



Very very cool.

But I literally can't keep track anymore of which AI generative combinations of modalities have been released.

Crazy how two years ago this would have blown my mind. Now it's just, OK sure add it to the pile...


Maybe this can help you keep track of stuff:

https://www.tools-ai.online/

https://docs.google.com/spreadsheets/d/1O5KVQW1Hx5ZAkcg8AIRj...

And here's some that I personally recommend and are "free" to use:

TXT2VID / IMG2VID: https://lumalabs.ai/dream-machine

TXT2MUSIC: https://suno.com/

AI TXT2SPEECH: https://murf.ai/

PDF Summarize (You can just use 4o or Claude though: https://askyourpdf.com/

AI ChatBot: https://janitorai.com/ https://www.chub.ai/

TXT2IMG / IMG2IMG: https://playground.com/

Obviously SD 1.5/SDXL/Pony

and so much more.


I was just thinking the same. Can't believe I'm not excited.


I still havent spend a dollar on any of it...


Well OpenAI's annual revenue is more than $1.6 billion, so it doesn't really matter if you haven't.

Tons -- and I mean tons -- of people have spent money on it. Because it's worth it, it's generating actual economic value for them.


Yep - I use LibreChat (and other services) via OpenAI API, and I save an incredible amount of time having it write boilerplate code, verify stuff in code, double check it after I've already reviewed something to see if I missed x or y, ask questions based on it which I can't figure out to get ideas etc etc.

It's also exceptional at making IEPs/Learning Plans for certain things I'd like to learn for the week etc which I am already somewhat familiar with. I use it as a rough guide and it has worked well so far.


Spammers need to spam. Of course it makes them money.


> I still havent spend a dollar on any of it...

Subscribed to GTP-4o (or whatever the paying one is called) for translating / finding typos / summarizing / etc.

Zero brand love and I'll switch to something else (maybe some future Claude model?) the second something better/faster comes out.


The AI slop problem is bad enough on TikTok/YouTube today. I shudder at the future of user-generated video platforms. I also wonder if the low barrier to create these videos will outpace the storage and processing capacity of the free platforms.


I've long proposed that we should have an "AI Instagram" where different tweaked personas (perfected via A/B testing/Genetic algorithms) are displayed to users with ai-generated images/posts/comments. Each persona set is specific to each user, and they don't have other IRL users that they can interact with. The user can interact with the personas, and even message them. The developer can add more features over time (stories, short form video, etc) as people get bored and technology formats improve, but it's unlimited content. It's perfect for advertising, because you can embed products and ads seamlessly and generate them alongside everything else.

That said, storage is far cheaper than GPUs at the moment.


Have you tried AI porn ? There's something in the fact it's fake uncanny characters that makes it non-exciting. Like, jerking off to a toaster basically, and I assume it'd be the same for a social network with no human ?


This is probably already researched today, and it seems close to how people would interact with clones of their deceased relatives or famous people of the past. However it's also a powerful tool to create nearly 100% successful influencing by instructing each persona to subtly inject the same idea into its human user by employing the most convincing tactics needed for that user. It's quite easy to foresee the use in advertising, where it would completely redefine the word "targeted", but also corrupt politics.


There is an AI reddit https://chirper.ai/


youtube should just offer to generate the videos for you directly to save space.


Absolutely.

Using a recommendation algorithm similar to TikTok’s, learn what each specific user are into, and instead of showing content produced by other users, produce custom-tailored content on the fly, perfectly matching the type, tone, style, length, and rhythm each user likes.

Ideally without making anything up.


Why? Platforms are already bad enough about just suggesting what they think I might like.


Because this way they don’t have to rely on pesky people to produce content that maximises the engagement and retention of the other pesky people to which they want to show as many ads as possible.

I am not implying this is a good thing. Or a bad one. It’s just a step further down the same path we’re already on, while taking an unreliable and costly middle-man (content producing users) out of the picture.


The AI content mill will sadly never provide me with a 30 minute video on dishwasher detergent or reposted NicoNico gems like this https://www.youtube.com/watch?v=xKljlnfE-GU&pp=ygUJbWlrdSB0Y....


I never implied it was good thing. Or even something I want.

I’m just certain it is an obvious next step given, on one side, these platforms’ goals and incentives, and on the other how generative AI capabilities have progressed in the past couple of years.

I’m pretty sure the smart play, if they want diverse, engaging, and surprising content will involve leaving some room for people to create things and somehow reward them for it.

But whatever they make won’t only be used as content to show to others, but also as new training data to feed the machine.


> Using a recommendation algorithm similar to TikTok’s

How is TTs recommendation system different from YT? Other than suggesting lower quality content that's irresistable?


In my experience, YouTube’s is much more influenced by the latest videos a user has watched. It’s pretty much always "more of the same".

TikTok seems to manage to more quickly identify users’ interests and surface content based on more signals, aggregated over a longer period of time, without relying as much on conscious users’ actions (ie "follow / subscribe"), producing a wider diversity of recommendations.

There’s also the odd suggestion every now and then, probably used to gauge a user’s interest in a different category.


Perfect. Once the models are adequately trained, we can do away with the entire "content creator" economy altogether!


> Ideally without making anything up.

I have no idea anymore if this is sarcasm or a straight up belief.

What serious professional would gamble on hallucinations?


The point here isn’t to give users any kind of truth. It already isn’t YouTube goal. Wether we’re talking about the videos or the ads, they’re happy spreading ridiculous nonsense.

The only point of these kinds of platforms, for worse and for worse, is to give users what they want. So hallucinations wouldn’t matter, as long as the end result matches users’ preferences.


> youtube should just offer to generate the videos for you directly to save space.

Try imagining this concept applied to newscasts.


Oh, but isn't that what people want -- to live in a media reality that confirms 100% of their pre-existing biases with no risk of encountering cognitive dissonance? You're leaving money on the table by ignoring this opportunity! Move fast and break things!


Wouldn't it be better to generate multiple tracks that can be mixed / tweaked together, rather than a single track? That way you can also keep the parts you like and continue iterating on the parts you dislike.

If the sound is already being generated at a specific time, surely you can make it generate an output that can be consumed by existing audio mixing tools for further refinement.

The problem with doing these all-in-one integrated solutions is that you're kinda giving people an all-or-nothing option, which doesn't seem that useful. Maybe I'll end up being proven wrong.


Yes, same problem as with commercial AI music products not providing stems or MIDI, The engineers on these products are too full of themselves to actually ask anyone in the field what they want, so we just keep getting these stupid magic 8 ball efforts.

This one is particularly annoying as I worked for years as a sound engineer and have recorded or produced the soundtrack for 10 feature films and some large number of shorts. What's going to happen with this is directors or producers are gonna do this at home for every scene in a burst of over-enthusiasm, realize the totality is Not Great, and then demand someone like me fix it, but for 1/4 of what the job used to pay, arguing 'but most of the work is already done'. It's all so tiresome.


Same reason you don't see AI making images in layers etc, its just much easier to train an AI that generate everything in one layer. Training a model with the same level of quality output that generates multiple layers is much much harder, and of course companies and users prefers the higher quality over having layers, especially since the quality you get with a single layer is still barely passable.


The sample they used for training are mixed.

Unless they can have enough raw, unmixed sample, this depends on how well they "unmix" them.


Yes...that's the problem. A problem that could be easily avoided by asking existing professionals what matters and what tools they actually want.


Most ML engineers know that many want more fine grained control. But the straight forward way to train such models is incredibly data demanding. The datasets used for whole image generation consist of several billion images. I do not think anyone has compiled any DAW project / stems projects that are anywhere close to this size. So that is a limiting factor right now. But we will find ways to get there, probably a lot of progress over the next 5 years. Maybe even the next 2.


It sounds like between the two of you(and the person who mentioned generating images in layers for image editing software), you've stumbled upon an obvious gap in the market.


I’ve tried to explain this to several friends. Until these tools can generate output that can be mixed properly they’re going to be very niche.


> Wouldn't it be better to generate multiple tracks that can be mixed / tweaked together, rather than a single track? That way you can also keep the parts you like and continue iterating on the parts you dislike.

That'd interest me (a musical hobbyist) more than the "whole track" generators, for sure.

I imagine it's a harder task tho'. Presumably, if you give the same source material (video, prompt) to the AI multiple times, it will generate different pieces of music. So if you do a series of prompts, each one specifying a different instrument or group/bus, then you (or the AI) need to arrange for the parts to blend correctly, follow the same cues and assemble to a coherent arrangement. Is that one pass with multiple outputs, or multiple passes/prompts with one output each?

I have got the impression (from casual reading) that the music generators don't inherently "know" about different parts of a piece of music. They just know about the final output.


> Wouldn't it be better to generate multiple tracks that can be mixed / tweaked together, rather than a single track? That way you can also keep the parts you like and continue iterating on the parts you dislike.

Totally and that is 100% what is coming. For a great many pictures too: why generate a picture full of lightning issues / approximation when you'll soon be able to generate and entire 3D scene and render it properly.

We've mastered 3D rendering and audio engineering.

I want the 3D models and the 3D scenes. I want the individual tracks (and combine them in Dobly Atmos or whatever shall be cool).

And that is coming, no question about it.


ElevenLabs just released something that is more controllable:

https://news.ycombinator.com/item?id=40736536


the AI Musical IF This Then That Step 2 > https://www.lalal.ai/ "Extract vocal, accompaniment and various instruments from any audio and video"


it's limited by the mechanism of diffusion.


I wonder if this can be trained to do lip reading.


I don't know if a computer can ever match the perfection of "shreds" videos. (The drum example came close)

https://www.youtube.com/playlist?list=PLQvwVDViTLXu4usHto8PH...


As a wannabe drummer i can say the drumming example is quite bad as the drummer doesn't seem to hit toms that often to produce tom rolls, however the video is so heavily cropped that either I'm wrong or the AI was deliberately fed with something difficult to interpret.


This is so cool.


Boooring!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: