It's also hard to do privacy-wise unless every text message she sends pre-generates the audio message using her on-device voice and then attaches it. That would make every message use 10x as much bandwidth, storage, and battery power. (10x is a random number but you get the point). Seems cute but really impractical.
I would think that your phone could request audio messages from the sender only when necessary. They already sync things like your DND status to show others so this would just be another flag. Messages could also then alert the sender that their message may be read aloud in their Personal Voice. Or maybe allow turning this on per conversation.
10x compared to what though? FaceTime (and similar) is already full-duplex video and audio, which I have to imagine is at least another 10x on top of what you’re describing. Are we really budgeting our computer resources so strictly that this would even show up as more than a rounding error?