Getting started with speech recognition and python
Question:
I would like to know where one could get started with speech recognition. Not with a library or anything that is fairly “Black Box’ed” But instead, I want to know where I can Actually make a simple speech recognition script. I have done some searching and found, not much, but what I have seen is that there are dictionaries of ‘sounds’ or syllables that can be pieced together to form text. So basically my question is where can I get started with this?
Also, since this is a little optimistic, I would also be fine with a library (for now) to use in my program. I saw that some speech to text libraries and APIs spit out only one results. This is ok, but it would be unrealiable. My current program already checks the grammar and everything of any text entered, so that way if I were to have say, the top ten results from the speech to text software, than It could check each and rule out any that don’t make sense.
Answers:
Dragonfly provides a clean framework for speech recognition on Windows. Check their Documentation for example usage. Since you aren’t looking for the big scale of features Dragonfly provides you might want to take a look at the no longer maintained PySpeech library.
Their source code looks easy to understand and maybe that’s what you want to look at first
If you really want to understand speech recognition from the ground up, look for a good signal processing package for python and then read up on speech recognition independently of the software.
But speech recognition is an extremely complex problem (basically because sounds interact in all sorts of ways when we talk). Even if you start with the best speech recognition library you can get your hands on, you’ll by no means find yourself with nothing more to do.
UPDATE: this is not working anymore
because google closed its platform
—
you can use https://pypi.python.org/pypi/pygsr
$> pip install pygsr
example usage:
from pygsr import Pygsr
speech = Pygsr()
# duration in seconds
speech.record(3)
# select the language
phrase, complete_response = speech.speech_to_text('en_US')
print phrase
Pocketsphinx is also a good alternative. There are Python bindings provided through SWIG that make it easy to integrate in a script.
For example:
from os import environ, path
from itertools import izip
from pocketsphinx import *
from sphinxbase import *
MODELDIR = "../../../model"
DATADIR = "../../../test/data"
# Create a decoder with certain model
config = Decoder.default_config()
config.set_string('-hmm', path.join(MODELDIR, 'hmm/en_US/hub4wsj_sc_8k'))
config.set_string('-lm', path.join(MODELDIR, 'lm/en_US/hub4.5000.DMP'))
config.set_string('-dict', path.join(MODELDIR, 'lm/en_US/hub4.5000.dic'))
decoder = Decoder(config)
# Decode static file.
decoder.decode_raw(open(path.join(DATADIR, 'goforward.raw'), 'rb'))
# Retrieve hypothesis.
hypothesis = decoder.hyp()
print 'Best hypothesis: ', hypothesis.best_score, hypothesis.hypstr
print 'Best hypothesis segments: ', [seg.word for seg in decoder.seg()]
# Access N best decodings.
print 'Best 10 hypothesis: '
for best, i in izip(decoder.nbest(), range(10)):
print best.hyp().best_score, best.hyp().hypstr
# Decode streaming data.
decoder = Decoder(config)
decoder.start_utt('goforward')
stream = open(path.join(DATADIR, 'goforward.raw'), 'rb')
while True:
buf = stream.read(1024)
if buf:
decoder.process_raw(buf, False, False)
else:
break
decoder.end_utt()
print 'Stream decoding result:', decoder.hyp().hypstr
For those who want to get deeper into the subject of speech recognition in Python, here are some links:
- http://www.slideshare.net/mchua/sigproc-selfstudy-17323823 – signal processing in Python, including Audio signal as the most interesting to play with.
I know the Question is old but just for people in future:
I use the speech_recognition
-Module and I love it. The only thing is, it requires Internet because it uses the Google to recognize the Speech. But that shouldn’t be a problem in most cases. The recognition works almost perfectly.
EDIT:
The speech_recognition
package can use more than just google to translate, including CMUsphinx (which allows offline recognition), among others. The only difference is a subtle change in the recognize command:
https://pypi.python.org/pypi/SpeechRecognition/
Here is a small code-example:
import speech_recognition as sr
r = sr.Recognizer()
with sr.Microphone() as source: # use the default microphone as the audio source
audio = r.listen(source) # listen for the first phrase and extract it into audio data
try:
print("You said " + r.recognize_google(audio)) # recognize speech using Google Speech Recognition - ONLINE
print("You said " + r.recognize_sphinx(audio)) # recognize speech using CMUsphinx Speech Recognition - OFFLINE
except LookupError: # speech is unintelligible
print("Could not understand audio")
There is just one thing what doesn’t work well for me: Listening in an infinity loop. After some Minutes it hangs up. (It’s not crashing, it’s just not responding.)
EDIT:
If you want to use Microphone without the infinity loop you should specify recording length.
Example code:
import speech_recognition as sr
r = sr.Recognizer()
with sr.Microphone() as source:
print("Speak:")
audio = r.listen(source, None, "time_to_record") # recording
This may be the most important thing to learn: the elementary concepts
of Signal Processing, in particular, Digital Signal Processing (DSP).
A little understanding of the abstract concepts will prepare you for
the bewildering cornucopia of tools in, say, scipy.signal.
First is Analog to Digital conversion (ADC). This really in the domain
of Audio Engineering, and is, nowadays, part of the recording process,
even if all you are doing is hooking a microphone to your computer.
If you are starting with analog recordings, this may be a question of
converting old tapes or vinyl long playing records, to digital form,
or extracting the audio from old video tapes. Easiest to just play the
source into the audio input jack of your computer and use the built in
hardware and software to capture a raw Linear Pulse Code Modulation
(LPCM) digital signal to a file. Audacity, which you mentioned, is a
great tool for this, and more.
The Fourier Transform is your friend. In Data Science terms it is
great for feature extraction, and feature-space dimension reduction,
particularly if you’re looking for features that span changes in sound
over the course of the entire sample. No space to explain here, but
raw data in the time domain is much harder for machine learning
algorithms to deal with than raw data in the frequency domain.
In particular you will be using the Fast Fourier Transform (FFT), a
very efficient form of the Discrete Fourier Transform (DFT). Nowadays
the FFT is usually done in the DSP hardware.
import speech_recognition as SRG
import time
store = SRG.Recognizer()
with SRG.Microphone() as s:
print("Speak...")
audio_input = store.record(s, duration=7)
print("Recording time:",time.strftime("%I:%M:%S"))
try:
text_output = store.recognize_google(audio_input)
print("Text converted from audio:n")
print(text_output)
print("Finished!!")
print("Execution time:",time.strftime("%I:%M:%S"))
except:
print("Couldn't process the audio input.")
This should work. You will have Audio input from your default mic saved into text form in text_output variable.
You can check out this link for more info: https://www.journaldev.com/37873/python-speech-to-text-speechrecognition
What we basically do is we first record the audio from mic then we use that Audio as a input for speech recognizer.
Only thing here is you need active internet connection and those two required python libraries speech_recognition
and pyaudio
.
Here is a Simple way for getting started with Speech Recognition in python using
online library:-
SpeechRecognition
Google APi:-
recognize_google()
Requirment:-
1-Python 3.9
2-Anaconda (Launch Jupyter)
In Code
Step 1:-
pip install pyaudio
pip install speechrecognition
Step 2:-
import speech_recognition as sr
r = sr.Recognizer()
Step 3:-
use the default microphone as the audio source
with sr.Microphone() as source:
r.adjust_for_ambient_noise(source, duration = 5)
audio = r.listen(source)
Step 4:-
print("Speak Anything :")
try:
text = r.recognize_google(audio)
print("You said : {}".format(text))
except:
print("Sorry could not recognize what you said")
Step 5:-
when above code is running perfectly: but giving slow or sometimes blank response
then add a few lines also:-
r.pause_threshold = 1
Complete Code
pip install pyaudio
pip install speechrecognition
*Here is the code*
import speech_recognition as sr
def recog():
r = sr.Recognizer()
with sr.Microphone() as source:
print("Say Something")
r.pause_threshold = 1
r.adjust_for_ambient_noise(source)
audio = r.listen(source)
try:
print("Recognizing..")
text = r.recognize_google(audio, language='en-in') # Specify Language Code
print("You said {}".format(text))
except Exception as e:
print(e)
print("Sorry")
recog()
I would like to know where one could get started with speech recognition. Not with a library or anything that is fairly “Black Box’ed” But instead, I want to know where I can Actually make a simple speech recognition script. I have done some searching and found, not much, but what I have seen is that there are dictionaries of ‘sounds’ or syllables that can be pieced together to form text. So basically my question is where can I get started with this?
Also, since this is a little optimistic, I would also be fine with a library (for now) to use in my program. I saw that some speech to text libraries and APIs spit out only one results. This is ok, but it would be unrealiable. My current program already checks the grammar and everything of any text entered, so that way if I were to have say, the top ten results from the speech to text software, than It could check each and rule out any that don’t make sense.
Dragonfly provides a clean framework for speech recognition on Windows. Check their Documentation for example usage. Since you aren’t looking for the big scale of features Dragonfly provides you might want to take a look at the no longer maintained PySpeech library.
Their source code looks easy to understand and maybe that’s what you want to look at first
If you really want to understand speech recognition from the ground up, look for a good signal processing package for python and then read up on speech recognition independently of the software.
But speech recognition is an extremely complex problem (basically because sounds interact in all sorts of ways when we talk). Even if you start with the best speech recognition library you can get your hands on, you’ll by no means find yourself with nothing more to do.
UPDATE: this is not working anymore
because google closed its platform
—
you can use https://pypi.python.org/pypi/pygsr
$> pip install pygsr
example usage:
from pygsr import Pygsr
speech = Pygsr()
# duration in seconds
speech.record(3)
# select the language
phrase, complete_response = speech.speech_to_text('en_US')
print phrase
Pocketsphinx is also a good alternative. There are Python bindings provided through SWIG that make it easy to integrate in a script.
For example:
from os import environ, path
from itertools import izip
from pocketsphinx import *
from sphinxbase import *
MODELDIR = "../../../model"
DATADIR = "../../../test/data"
# Create a decoder with certain model
config = Decoder.default_config()
config.set_string('-hmm', path.join(MODELDIR, 'hmm/en_US/hub4wsj_sc_8k'))
config.set_string('-lm', path.join(MODELDIR, 'lm/en_US/hub4.5000.DMP'))
config.set_string('-dict', path.join(MODELDIR, 'lm/en_US/hub4.5000.dic'))
decoder = Decoder(config)
# Decode static file.
decoder.decode_raw(open(path.join(DATADIR, 'goforward.raw'), 'rb'))
# Retrieve hypothesis.
hypothesis = decoder.hyp()
print 'Best hypothesis: ', hypothesis.best_score, hypothesis.hypstr
print 'Best hypothesis segments: ', [seg.word for seg in decoder.seg()]
# Access N best decodings.
print 'Best 10 hypothesis: '
for best, i in izip(decoder.nbest(), range(10)):
print best.hyp().best_score, best.hyp().hypstr
# Decode streaming data.
decoder = Decoder(config)
decoder.start_utt('goforward')
stream = open(path.join(DATADIR, 'goforward.raw'), 'rb')
while True:
buf = stream.read(1024)
if buf:
decoder.process_raw(buf, False, False)
else:
break
decoder.end_utt()
print 'Stream decoding result:', decoder.hyp().hypstr
For those who want to get deeper into the subject of speech recognition in Python, here are some links:
- http://www.slideshare.net/mchua/sigproc-selfstudy-17323823 – signal processing in Python, including Audio signal as the most interesting to play with.
I know the Question is old but just for people in future:
I use the speech_recognition
-Module and I love it. The only thing is, it requires Internet because it uses the Google to recognize the Speech. But that shouldn’t be a problem in most cases. The recognition works almost perfectly.
EDIT:
The speech_recognition
package can use more than just google to translate, including CMUsphinx (which allows offline recognition), among others. The only difference is a subtle change in the recognize command:
https://pypi.python.org/pypi/SpeechRecognition/
Here is a small code-example:
import speech_recognition as sr
r = sr.Recognizer()
with sr.Microphone() as source: # use the default microphone as the audio source
audio = r.listen(source) # listen for the first phrase and extract it into audio data
try:
print("You said " + r.recognize_google(audio)) # recognize speech using Google Speech Recognition - ONLINE
print("You said " + r.recognize_sphinx(audio)) # recognize speech using CMUsphinx Speech Recognition - OFFLINE
except LookupError: # speech is unintelligible
print("Could not understand audio")
There is just one thing what doesn’t work well for me: Listening in an infinity loop. After some Minutes it hangs up. (It’s not crashing, it’s just not responding.)
EDIT:
If you want to use Microphone without the infinity loop you should specify recording length.
Example code:
import speech_recognition as sr
r = sr.Recognizer()
with sr.Microphone() as source:
print("Speak:")
audio = r.listen(source, None, "time_to_record") # recording
This may be the most important thing to learn: the elementary concepts
of Signal Processing, in particular, Digital Signal Processing (DSP).
A little understanding of the abstract concepts will prepare you for
the bewildering cornucopia of tools in, say, scipy.signal.First is Analog to Digital conversion (ADC). This really in the domain
of Audio Engineering, and is, nowadays, part of the recording process,
even if all you are doing is hooking a microphone to your computer.If you are starting with analog recordings, this may be a question of
converting old tapes or vinyl long playing records, to digital form,
or extracting the audio from old video tapes. Easiest to just play the
source into the audio input jack of your computer and use the built in
hardware and software to capture a raw Linear Pulse Code Modulation
(LPCM) digital signal to a file. Audacity, which you mentioned, is a
great tool for this, and more.The Fourier Transform is your friend. In Data Science terms it is
great for feature extraction, and feature-space dimension reduction,
particularly if you’re looking for features that span changes in sound
over the course of the entire sample. No space to explain here, but
raw data in the time domain is much harder for machine learning
algorithms to deal with than raw data in the frequency domain.In particular you will be using the Fast Fourier Transform (FFT), a
very efficient form of the Discrete Fourier Transform (DFT). Nowadays
the FFT is usually done in the DSP hardware.
import speech_recognition as SRG
import time
store = SRG.Recognizer()
with SRG.Microphone() as s:
print("Speak...")
audio_input = store.record(s, duration=7)
print("Recording time:",time.strftime("%I:%M:%S"))
try:
text_output = store.recognize_google(audio_input)
print("Text converted from audio:n")
print(text_output)
print("Finished!!")
print("Execution time:",time.strftime("%I:%M:%S"))
except:
print("Couldn't process the audio input.")
This should work. You will have Audio input from your default mic saved into text form in text_output variable.
You can check out this link for more info: https://www.journaldev.com/37873/python-speech-to-text-speechrecognition
What we basically do is we first record the audio from mic then we use that Audio as a input for speech recognizer.
Only thing here is you need active internet connection and those two required python libraries speech_recognition
and pyaudio
.
Here is a Simple way for getting started with Speech Recognition in python using
online library:-
SpeechRecognition
Google APi:-
recognize_google()
Requirment:-
1-Python 3.9
2-Anaconda (Launch Jupyter)
In Code
Step 1:-
pip install pyaudio
pip install speechrecognition
Step 2:-
import speech_recognition as sr
r = sr.Recognizer()
Step 3:-
use the default microphone as the audio source
with sr.Microphone() as source:
r.adjust_for_ambient_noise(source, duration = 5)
audio = r.listen(source)
Step 4:-
print("Speak Anything :")
try:
text = r.recognize_google(audio)
print("You said : {}".format(text))
except:
print("Sorry could not recognize what you said")
Step 5:-
when above code is running perfectly: but giving slow or sometimes blank response
then add a few lines also:-
r.pause_threshold = 1
Complete Code
pip install pyaudio
pip install speechrecognition
*Here is the code*
import speech_recognition as sr
def recog():
r = sr.Recognizer()
with sr.Microphone() as source:
print("Say Something")
r.pause_threshold = 1
r.adjust_for_ambient_noise(source)
audio = r.listen(source)
try:
print("Recognizing..")
text = r.recognize_google(audio, language='en-in') # Specify Language Code
print("You said {}".format(text))
except Exception as e:
print(e)
print("Sorry")
recog()