This has always been a struggle. Rhasspy can gather lists of songs, artists, etc. but it will have to guess many of their pronunciations. And it seems artist/band names often purposely thwart conventional pronunciation rules :P
Rhasspy is a more powerful general-purpose application and GUI; voice2json is more like a library or micro-service that does exactly one thing: convert a speech waveform to JSON. They share some DNA though (same syntax for defining vocabulary).
Rhasspy author here, thanks for posting! Just wanted to mention that I've joined Nabu Casa (creators of Home Assistant) this month, so Rhasspy will be receiving updates again and be a major part of Home Assistant's "Year of Voice" in 2023 :)
Thank you for your work!
I was in a panic when Snips was bought up. After some research I landed on Rhasspy as my new local-first digital assistant, and it's been fantastic. Been using it for a few years now with satellites around the house with the 'brain' running on a VM. Even have a Siri shortcut which transcripts my speech input then makes an HTTP request to 'brain' instance so that I can use Rhasspy even if not around a satellite instance. This even works over my VPN!
Hi all, author here. Besides the tech of Mimic 3 itself, I'm interested in training voices in as many (human) languages as possible. All it takes is one person willing to donate a dataset for everyone to benefit!
...well, that and a bunch of stuff with phonemes. But I'll do that part :)
The Mozilla Common Voice dataset is awesome - however it's useful the opposite purpose - speech-to-text. This is because it is a lot of different people using a range of hardware, speaking similar phrases.
For good text-to-speech you need 1 person speaking different phrases but very consistently. Here's an example dataset from Thorsten a German open voice enthusiast: https://openslr.org/95/
What does it take to add Chinese and Japanese to this? Surely it's a lot more than just training sets right? I have an android phone without access to google tts, so this might actually potentially be a nice alternative.
They want you to make good quality audio recordings of you speaking about 20 000 phrases. It could take 40 to 80 hours of speaking and recording, maximum 4 hours per day.
The amount of data depends on if there's a voice for the language already. If so, about 2 hours of data is usually good enough. Otherwise, 10-20 hours usually does it.
Python is only really the glue here. The models are trained in PyTorch and exported to Microsoft's Onnx runtime (C++). So the bulk of the inference CPU cycles are outside Python.