Cool, I built a prototype of something very similar (face+voice cloning, no vide...

leobg · 2024-10-01T20:17:50 1727813870

This looks awesome. Didn’t seem to hear me, but the video looks great. Can you share what models you are using? You say these are all open models.

shtack · 2024-10-01T23:12:18 1727824338

The model doing the heavy lifting is https://github.com/Rudrabha/Wav2Lip

Mic permissions on mobile are tricky, which might have been your issue? Note in this prototype you also need to hold the blue button down to speak.

leobg · 2024-10-02T07:50:17 1727855417

Interesting. I didn’t think you could get anything close to realtime with Wav2Lip.

shtack · 2024-10-02T12:16:16 1727871376

With a dedicated GPU and some cleverness it can be relatively quick. I split the response on punctuation and generate smaller clips in a pipeline. I haven't taken the model apart to try streaming the frames coming out of ffmpeg yet, but that would probably help a lot.