If their multi-modal model works similarly to any of the existing voice models then they would only need a fairly small number of samples. Current voice models only require you to prefix a voice sample in order to mimic it. They also still have the other voices available so I really doubt the ScarJo kerfuffle has anything to do with the delay. The spicier take is that they're having trouble preventing users from convincing the model to roleplay NSFW interactions. I have no doubt there will be another round of pearl clutching after the new voice mode goes live when someone posts their phone sex session.
"We need time to record with a different voice actress."