Voice Recognition and Text to Speech in Python

danso · on Feb 25, 2016

FWIW, IBM has a wonderful speech to text API...I've put together a repo of examples and Python code:

https://github.com/dannguyen/watson-word-watcher

One of the great things about it is its word-level time stamp and confidence data that it returns...here's a few super cuts I've made from the presidential primary debates:

https://www.youtube.com/watch?v=VbXUUSFat9w&list=PLLrlUAN-Lo...

It's not perfect by any means, but the granular results give you a place to start from...here's a super cut of cuss words from a well known episode of The Wire...only 59 such words were heard by Watson even though one scene contains 30+ F-bombs alone:

https://www.youtube.com/watch?v=muP5aH1aWUw&feature=youtu.be

The service is free for the first 1000 minutes each month.

pbw · on Feb 25, 2016

It took me a while to understand what you did here. I was waiting for some kind of subtitles showing the recognition ability.

But you are saying you performed speech recognition on the full video then edited it according to where the words you targeted were found. I liked the bomb/terrorist one, the others didn't seem to be "saying" anything.

danso · on Feb 25, 2016

Yeah, I was a bit lazy...I could have used moviepy (which I currently use, but merely as a wrapper around ffmpeg) to add subtitles to show which identified word was identified...I'm hoping to make this into a command-line tool for myself to quickly transcribe things...though making supercuts is just a fun way to demonstrate the concepts.

The important takeaway is that the Watson API parses a stream of spoken audio (other services, such as Microsoft's Oxford, works only on 10-second chunks, i.e. optimized for user commands) and tokenizes it...what you get is a timestamp for when each recognized word appears, as well as a confidence level and alternatives if you so specify. Other speech-transcription options don't always provide this...I don't think PocketSphinx does, for example. Or sending your audio to a mTurk based transcription service.

Here's a little more detail about The Wire transcription, along with the JSON that Watson returns, and a simplified CSV version of it:

https://github.com/dannguyen/watson-word-watcher/tree/master...

th-ai · on Feb 25, 2016

Youtube speech recognition is getting quite good, at least for talking heads in English. Are there additional top tier API's other than the IBM?

danso · on Feb 25, 2016

AT&T has its own "Watson"...but it requires signing up for a premium account, which I think involves an upfront cost:

http://developer.att.com/apis/speech

Twilio has one that also requires payment:

https://www.twilio.com/docs/api/rest/transcription

It limits input audio to 2 minutes. And I would have to guess that its model is specifically tuned to phone messages, i.e. one speaker, relatively clear and focused audio, and certain probabilities of phrases.

kleiba · on Feb 25, 2016

Kids, it's called "speech recognition". Voice recognition also exists, but it's the task of identifying a user based on his/her voice, not the task of transcribing spoken input as text.

transpy · on Feb 25, 2016

Dad, I told you not to use my hacker news account. Log out please.

turnip1979 · on Feb 28, 2016

Are there any decent opensource projects out there (preferably with Python APIs) that do speaker or "voice recognition" reasonably well? I know this is an area of active research in academia.

jwitko · on Feb 25, 2016

Kids?

DecoPerson · on Feb 26, 2016

He jests.

giancarlostoro · on Feb 25, 2016

It really would be amazing to be able to get voice recognition software that covers at least recognizing a small enough fraction of our language to be useful without having to reach the cloud. It is definitely a dream I hope we one day achieve, thanks for the article, will test it on my day off and play with it a bit.

lovelearning · on Feb 25, 2016

Pocketsphinx/Sphinx with a small, use-case specific dictionary showed much better accuracy for my accent and speech defects, than any of these cloud based recognition systems. I used a standard acoustic model, but it probably would have been even more accurate had I trained a custom acoustic model.

For simple use cases like home automation or desktop automation, I think it's a more practical approach than depending on a cloud API.

kastnerkyle · on Feb 25, 2016

I have an easy wrapper to pocketsphinx here [1] - it has come in handy for me in the past. Another option is a gstreamer server with Kaldi [2].

[1] https://github.com/kastnerkyle/ez-phones

[2] https://www.reddit.com/r/MachineLearning/comments/3pr4v4/are...

danso · on Feb 25, 2016

I haven't tried out Pocket Sphinx myself...could you describe the training process, e.g. how long did it take, how much audio did you have to record, how easy was it to iterate to improve accuracy?

lovelearning · on Feb 25, 2016

PocketSphinx/Sphinx use three models - an acoustic model, a language model and a phonetic dictionary. I'm no expert, but as I understand them, the acoustic model converts audio samples into phonemes(?), the language model contains probabilities of sequences of words, and the phonetic dictionary is a mapping of words to phonemes.

Initially, I just used standard en-us acoustic model, US english generic language model, and its associated phonetic dictionary. This was the baseline for judging accuracy. It was ok, but neither fast nor very accurate (likely due to my accent and speech defects). I'd say it was about 70% accurate.

Simply reducing the size of the vocabulary boosts accuracy because there is that much less chance of a mistake. It also improves recognition speed. For each of my use cases (home and desktop automation), I created a plain text file with the relevant command words. Then used their online tool [1] to generate a language model and phonetic dictionary from it.

For the acoustic model, there are two approaches - "adapting" and "training". Training is from scratch, while adapting adapts a standard acoustic model to better match personal accent or dialect or speech defects.

I found training as described [2] rather intimidating, and never tried it out. This is likely to take a lot of time (a couple of days atleast I think, based on my adaptation experience).

Instead I "adapted" the en-us acoustic model [3]. About an hour to come up with some grammatically correct text that included all the command words and phrases I wanted. Then reading it aloud while recording using Audacity. I attempted this multiple times, fiddling around with microphone volume and gain, trying to block ambient noise (I live in a rather noisy env), redoing it, final take. Took around 8 hours altogether with breaks. Finally generating the adapted acoustic model. About an hour.

About 95% of the time it understands what I say. About 5% of the time, I have to repeat. Especially with phrases.

Did this on both a desktop and raspberry pi. The Pi is the one managing home automation. I'm happy with it :)

[1]: http://www.speech.cs.cmu.edu/tools/lmtool-new.html

[2]: http://cmusphinx.sourceforge.net/wiki/tutorialam

[3]: http://cmusphinx.sourceforge.net/wiki/tutorialadapt

PS: Reading their documentation and searching for downloads takes more time than the actual task. They really need to improve those.

vram22 · on Feb 25, 2016

If not confidential, can you describe what kinds of automation you used this for, particularly the desktop automation?

I was interested in automating transcription to text of my own reminders to myself and other such audio files, say taken on the PC or on a portable voice recorder, hence the earlier trials I did. But at the time nothing worked out well enough, IIRC.

lovelearning · on Feb 26, 2016

Nothing confidential at all :). I was playing with them because I personally don't like using keyboard and mouse, and also have some ideas for making computing easier for handicapped people.

My current desktop automation is doing command recognition. Commands like "open editor / email / browser", "shutdown", "suspend"...about 20 commands in all. 'pocketsphinx_continuous' is started as a daemon at startup and keeps listening in the background (I'm on Ubuntu).

I think from a speech recognition internals point of view transcription is more complex than recognizing these short command phrases. The training or adaptation corpus would have to be much larger than what I used.

vram22 · on Feb 26, 2016

Thanks. Good uses.

He he, the voice "shutdown" command you mention reminds me of a small assembly language routine that I used to use to reboot MSDOS PCs; it was just a single instruction to jump to the start of the BIOS (cold?) boot entry point, IIRC (JMP F000:FFF0 or something like that). Used to enter it into DOS's DEBUG.COM utility with the A command (for Assemble) and then write it out to disk as a tiny .COM file. (IOW, you did not even need an assembler to create it.)

Then you could reboot the PC just by typing:

REBOOT

at the DOS prompt.

Did all kinds of tricks of the trade (not just like that, many other kinds), in the earlier DOS and (more in) UNIX days ... Good fun, and useful to customers, many a time, too, including saving their bacon (aka data) multiple times (with, of course, no backups by them).

rtpg · on Feb 25, 2016

My impression is that the super accurate stuff like Google's voice recognition and Siri are all feuled by masssssssive amounts of data. So you build up these recognition networks based off of a bunch of data sources, and get better over time, but the recognition is more based off of the data than the code.

It's the whole "Memory is a process, not a hard drive" thing: Voice recognition as it is today is a slowly evolving graph from input data. You could in theory compress the graph and have it available offline. But it would be hard to chop it up in a way that doesn't completely bust the recognition.

Karlozkiller · on Feb 25, 2016

There's actually some research on compressing ANN's to the size that it could be embedded in all sorts of devices. I think I saw something about it on HN a few months back?

amelius · on Feb 25, 2016

> It really would be amazing to be able to get voice recognition software that covers at least recognizing a small enough fraction of our language to be useful without having to reach the cloud.

Well, I guess at some point this functionality will become part of the OS. When OSX and Windows offer this, then Linux cannot stay behind, and we will see open source speech recognition libraries.

dharma1 · on Feb 25, 2016

I'm hoping some open source libraries will come out of this - it's state of the art

https://github.com/baidu-research/warp-ctc

Kronopath · on Feb 25, 2016

There are plenty of those. Voice recognition is nothing new—I remember playing around with "speakable items" back in Mac OS 7. It did well enough to recognize certain key words and phrases.

c22 · on Feb 25, 2016

No kidding. I had a toy "robot" that responded to 4 or 5 voice commands when I was a child in the late 80s...

super_mario · on Feb 25, 2016

Every Mac with OS 10.9 and later comes with speech recognition software that works without internet access and is really really good. You can dictate entire documents, emails, even have it type commands in terminal etc. In OS X 10.11 you can even drive the UI by only speaking.

amelius · on Feb 25, 2016

> It really would be amazing to be able to get voice recognition software that covers at least recognizing a small enough fraction of our language to be useful without having to reach the cloud.

Are there any academic groups working on this topic, and do they have prototype implementations?

skykooler · on Feb 25, 2016

Julius [1] is a pretty good offline speech recognition engine. In my tests it seems to have about 95% accuracy in grammar-based models, and it supports continuous dictation. There is also a decent Python module which supports Python 2, and Python 3 with a few tweaks.

HOWEVER:

The only continuous dictation models available for Julius are Japanese, as it is a Japanese project. This is mainly an issue of training data. The VoxForge models are working towards releasing one for English once they get 140 hours of training data (last time I checked they were around 130); but even so the quality is likely to be far less than commercial speech recognition products, which generally have thousands of hours of training.

[1] http://julius.osdn.jp/en_index.php

AndrewUnmuted · on Feb 25, 2016

Julius is my preferred speech recognition engine. I've built an application[0] which enables users to control their Linux desktops with their voices, and uses Julius to do the heavy lifting.

[0]: https://github.com/SacredData/COMPUTER

bainsfather · on Feb 25, 2016

After a quick look, it seems Julius doesn't use the new deep-learning stuff?

In terms of data, http://www.openslr.org/12/ says it has 300 hours + of speech+text from librivox audiobooks. Using Librovox recordings seemed a great idea for making a freely available large dataset.

IshKebab · on Feb 25, 2016

Don't expect this to be anything like modern "good" speech recognition. Sphinx is definitely from the 00's when it seemed like speech recognition would never be solved.

Apparently Kaldi is a lot better, but good luck setting it up!

privong · on Feb 25, 2016

Another project along similar lines is the Jasper Project[0], which has received some HN coverage in the past several years[1]. It interfaces with many of the same speech recognition and text-to-speech libraries.

[0] https://jasperproject.github.io/

[1] https://hn.algolia.com/?query=Jasper%20Project&sort=byPopula...

squeaky-clean · on Feb 25, 2016

Very cool! I just started playing with speech recognition in Python for home automation this week. I'm controlling some WeMo switches and my PC with an Android Tablet using Autovoice, and it works well as a proof-of-concept, but Autovoice doesn't always register commands, and the "Okay, Google" speech to text can be slow sometimes. I'd like it to take less than 5 seconds between saying "TV Off" and the TV actually turning off., with Autovoice it's anywhere from 3s to 25s depending on the lag. I also figure with real code, I can get commands that are more flexible than Autovoice's regex.

Aside from circumventing lag, I can also give it some personality. I want to name it Marvin, after the robot from H2G2, so that I can say:

"Marvin, turn the TV off"

"Here I am, brain the size of a planet, and you ask me to turn off the tv. Call that job satisfaction, 'cause I don't."

_r5wf · on Feb 25, 2016

They should move from Sphinx to Kaldi and from GMM to DNN acoustic models. Instant 30% improvement.

luke-stanley · on Feb 26, 2016

http://kaldi.sourceforge.net/about.html

turnip1979 · on Feb 28, 2016

Does Kaldi need Windows? I only saw installation instructions for Windows. Also .. I just tried Pocket Sphinx ... says it works on Windows and Linux. So .. no non-apple or cross platform speech rec for us mac devs?

_r5wf · on Feb 28, 2016

AFAIK, On the contrary. They officially support only linux. But community provides Windows support. I do not know Mac support. AFAIK again, Kaldi is mainly targeted server and desktop applications

_r5wf · on Feb 28, 2016

They use Github now.

ivan_ah · on Feb 25, 2016

For folks who want to try this at home on Mac OS X, you'll need to change 'sapi5' to 'nsss' on the line 'speech_engine = pyttsx.init('sapi5')'.

I also had to 'brew install portaudio flac swig' and a bunch of other python libs. By the time it ran, 'pip freeze' returned:

    altgraph==0.12
    macholib==1.7
    modulegraph==0.12.1
    py2app==0.9
    PyAudio==0.2.9
    pyobjc==3.0.4
    pyttsx==1.1
    SpeechRecognition==3.3.0
    pocketsphinx==0.0.9

My fork of the gist is here: https://gist.github.com/ivanistheone/b988d3de542c1bdd6a90

vram22 · on Feb 25, 2016

Nice work, ggulati. I had done some roughly similar stuff, but more basic, using same / similar libraries (but you have researched more libs), a while ago:

Recognizing speech (speech-to-text) with the Python speech module

https://code.activestate.com/recipes/579115-recognizing-spee...

and

Python text-to-speech with pyttsx

https://code.activestate.com/recipes/578839-python-text-to-s...

Good stuff. I like this area.

whizzkid · on Feb 25, 2016

Microsoft's translation API has 1 million characters/month free version for text to speech with male/female voice.

It is good enough quality and a good start for those who can not afford paying for Google's API.

iamcreasy · on Feb 25, 2016

Just checked, it's 2 Million character/month for free.

archiebunker · on Feb 25, 2016

Excellent post. Very interesting. I see how it works but am using Python 2.7 so based on your headline I suppose it won't work for me. This is the first real lead I've seen for integrating it easily. Pricing isn't terrible, if it goes production. Too bad there is no way to test it first for development. But we're lucky to have this at all.

The link to the VLC library is pretty handy.

ggulati · on Feb 25, 2016

Most of the stuff I found was for Python 2.7! I'll edit that into the post. My focus was for finding libraries that worked with new Python code, e.g. Python 3.5 code.

All of those libraries have Python 2.7 versions. Actually for all of them you pip install the same library; for pyttsx, `pip install pyttsx` and ignore jpercent's update.

I'm not sure what you mean about pricing and testing for development. Are you referring to Google's services? They offer 50 reqs/day for voice recognition on a free developer API key (https://www.chromium.org/developers/how-tos/api-keys). Google Translate can also be used by gTTS; it will rate limit or block you if you send too many reqs/min or per day without an appropriately registered API key, but you could play around with it for sure.

If voice recognition is important, it might be worth investigating Sphinx more and putting the time to tweak their English language model files. Synthesis is more difficult, though I think the Windows SAPI, OSX NSSS, and ESpeak on *nix are all "good enough." There are also a range of commercial libraries.

dr_zoidberg · on Feb 25, 2016

I too thought it was Python 3 only before I read it. Maybe a better title would be "Coding Jarvis in Python in 2016" and then explaining in the first paragraph that this is Python 2 and 3 compatible, with your personal focus on 3?

ggulati · on Feb 25, 2016

Thanks for the feedback; I updated the blog post.

Karlozkiller · on Feb 25, 2016

I have had a problem with using the speech_recognition library in that it does not stop listening when silence occurs.

After trying to tweak the threshold parameters without success I just figured I'd add a custom key-command to break the listening loop in my project.

infocollector · on Feb 25, 2016

Does this work without an internet connection (once downloaded)? If yes, How big is the downloaded footprint? I still haven't gone through the webpage carefully.

akerro · on Feb 25, 2016

There is project Sirius which does it, take a look

http://sirius.clarity-lab.org/category/watch/

ggulati · on Feb 25, 2016

If you use Sphinx for speech recognition and use pyttsx for text to speech (Windows Speech API, OSX NSSS, or ESpeak on Linux) it all works offline - see the "Jarvis's Brain" section.

roel_v · on Feb 25, 2016

No, except for the stt part using sphinx, which is tricky to set up for it to be accurate enough (seems the author of the op didn't go that far)