save microphone audio input when using azure speech to text

Question:

I’m currently using Azure speech to text in my project. It is recognizing speech input directly from microphone (which is what I want) and saving the text output, but I’m also interested in saving that audio input so that I can listen to it later on. Before moving to Azure I was using the python speech recognition library with recognize_google, that allowed me to use get_wav_data() to save the input as a .wav file. Is there something similar I can use with Azure? I read the documentation but could only find ways to save audio files for text to speech. My temporary solution is to save the audio input myself first and then use the azure stt on that audio file rather than directly using the microphone for input, but I’m worried this will slow down the process. Any ideas?
Thank you in advance!

Asked By: macper

||

Answers:

This is Darren from the Microsoft Speech SDK Team. Unfortunately, at the moment there is no built-in support for simultaneously doing live recognition from a microphone and writing the audio to a WAV file. We have heard this customer request before and we will consider adding this feature in a future version of the Speech SDK.

What I think you can do at the moment (it will require a bit of programming on your part), is use Speech SDK with a push stream. You can write code to read audio buffers from the microphone and write it to a WAV file. At the same time, you can push the same audio buffers into Speech SDK for recognition. We have Python samples showing how to use Speech SDK with push stream. See function "speech_recognition_with_push_stream" in this file: https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/python/console/speech_sample.py. However, I’m not familiar with Python options for reading real-time audio buffers from a Microphone, and writing to WAV file.
Darren

Answered By: Darren Cohen

If you use Azure’s speech_recognizer.recognize_once_async(), you can simultaneously capture the microphone with pyaudio. Below is the code I use:

#!/usr/bin/env python3

# enter your output path here:
output_file='/Users/username/micaudio.wav'

import pyaudio, signal, sys, os, requests, wave
pa = pyaudio.PyAudio()
import azure.cognitiveservices.speech as speechsdk

def vocrec_callback(in_data, frame_count, time_info, status):
    global voc_data
    voc_data['frames'].append(in_data)
    return (in_data, pyaudio.paContinue)

def vocrec_start():
    global voc_stream
    global voc_data
    voc_data = {
        'channels':1 if sys.platform == 'darwin' else 2,
        'rate':44100,
        'width':pa.get_sample_size(pyaudio.paInt16),
        'format':pyaudio.paInt16,
        'frames':[]
    }
    voc_stream = pa.open(format=voc_data['format'],
                    channels=voc_data['channels'],
                    rate=voc_data['rate'],
                    input=True,
                    output=False,
                    stream_callback=vocrec_callback)
    
def vocrec_stop():
    voc_stream.close()

def vocrec_write():
    with wave.open(output_file, 'wb') as wave_file:
        wave_file.setnchannels(voc_data['channels'])
        wave_file.setsampwidth(voc_data['width'])
        wave_file.setframerate(voc_data['rate'])
        wave_file.writeframes(b''.join(voc_data['frames']))

class SIGINT_handler():
    def __init__(self):
        self.SIGINT = False
    def signal_handler(self, signal, frame):
        self.SIGINT = True
        print('You pressed Ctrl+C!')
        vocrec_stop()
        quit()

def init_azure():
    global speech_recognizer
    #  ——— check azure keys
    my_speech_key = os.getenv('SPEECH_KEY')
    if my_speech_key is None:
        error_and_quit("Error: No Azure Key.")
    my_speech_region = os.getenv('SPEECH_REGION')
    if my_speech_region is None:
        error_and_quit("Error: No Azure Region.")
    _headers = {
        'Ocp-Apim-Subscription-Key': my_speech_key,
        'Content-type': 'application/x-www-form-urlencoded',
        # 'Content-Length': '0',
    }
    _URL = f"https://{my_speech_region}.api.cognitive.microsoft.com/sts/v1.0/issueToken"
    _response = requests.post(_URL,headers=_headers)
    if not "200" in str(_response):
        error_and_quit("Error: Wrong Azure Key Or Region.")
    #  ——— keys correct. continue
    speech_config = speechsdk.SpeechConfig(subscription=os.environ.get('SPEECH_KEY'),
                                           region=os.environ.get('SPEECH_REGION'))
    audio_config_stt = speechsdk.audio.AudioConfig(use_default_microphone=True)
    speech_config.set_property(speechsdk.PropertyId.SpeechServiceResponse_RequestSentenceBoundary, 'true')
    #  ——— disable profanity filter:
    speech_config.set_property(speechsdk.PropertyId.SpeechServiceResponse_ProfanityOption, "2")
    speech_config.speech_recognition_language="en-US"
    speech_recognizer = speechsdk.SpeechRecognizer(
        speech_config=speech_config,
        audio_config=audio_config_stt)

def error_and_quit(_error):
     print(error)
     quit()

def recognize_speech ():
    vocrec_start()
    print("Say something: ")
    speech_recognition_result = speech_recognizer.recognize_once_async().get()
    print("Recording done.")
    vocrec_stop()
    vocrec_write()
    quit()

handler = SIGINT_handler()
signal.signal(signal.SIGINT, handler.signal_handler)

init_azure()
recognize_speech()
Answered By: bozolino