This is the major issue with the majority of this technology at the moment. Theres a plethora of options available and soon to be unveiled by several startups who are talking up their tech... but they are almost all for "editing"/"after the recording" work. You have to have a complete recorded track you can pass into their software (usually by uploading to their service) and then it will crunch away at the file and work their magic.
The current real time options I've found are... lacking, they are mostly fake/toys (not actually using voice cloning, just old school pitch shifting) or tech demo videos, with a scattering of research papers which are highly variable in terms of "how easily can i reproduce this", ranging from "sure if I want to waste money on a google colab instance, to "only works with specific model of video card due to reasons"
If you know of any real-time (audio stream in -> audio stream out) voice cloning/transform/replacement tools, feel free to post about them in a reply, this is an area of tech I'm trying to keep on top of and I'm only human so I have no idea what new company or research I might miss.
Hey - ElevenLabs dev here. The quality above works with <1s latency that for some real-time apps is already sufficient. On smaller chunks of text it can be as quick as ~500ms.