If their multi-modal model works similarly to any of the existing voice models t...

If their multi-modal model works similarly to any of the existing voice models then they would only need a fairly small number of samples. Current voice models only require you to prefix a voice sample in order to mimic it. They also still have the other voices available so I really doubt the ScarJo kerfuffle has anything to do with the delay. The spicier take is that they're having trouble preventing users from convincing the model to roleplay NSFW interactions. I have no doubt there will be another round of pearl clutching after the new voice mode goes live when someone posts their phone sex session.