Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Cool, I built a prototype of something very similar (face+voice cloning, no video analysis) using openly available models/APIs: https://bslsk0.appspot.com/

The video latency is definitely the biggest hurdle. With dedicated a100s I can get it down <2s, but it's pricy.



This looks awesome. Didn’t seem to hear me, but the video looks great. Can you share what models you are using? You say these are all open models.


The model doing the heavy lifting is https://github.com/Rudrabha/Wav2Lip

Mic permissions on mobile are tricky, which might have been your issue? Note in this prototype you also need to hold the blue button down to speak.


Interesting. I didn’t think you could get anything close to realtime with Wav2Lip.


With a dedicated GPU and some cleverness it can be relatively quick. I split the response on punctuation and generate smaller clips in a pipeline. I haven't taken the model apart to try streaming the frames coming out of ffmpeg yet, but that would probably help a lot.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: