Cool, I built a prototype of something very similar (face+voice cloning, no video analysis) using openly available models/APIs: https://bslsk0.appspot.com/
The video latency is definitely the biggest hurdle. With dedicated a100s I can get it down <2s, but it's pricy.
With a dedicated GPU and some cleverness it can be relatively quick. I split the response on punctuation and generate smaller clips in a pipeline. I haven't taken the model apart to try streaming the frames coming out of ffmpeg yet, but that would probably help a lot.
The video latency is definitely the biggest hurdle. With dedicated a100s I can get it down <2s, but it's pricy.