Custom audio input bytes to azure cognitive speech translation service in Python

Question:

I am in need to able to translate custom audio bytes which I can get from any source and translate the voice into the language I need (currently Hindi). I have been trying to pass custom audio bytes using following code in Python:

import azure.cognitiveservices.speech as speechsdk
from azure.cognitiveservices.speech.audio import AudioStreamFormat, PullAudioInputStream, PullAudioInputStreamCallback, AudioConfig, PushAudioInputStream


speech_key, service_region = "key", "region"

channels = 1
bitsPerSample = 16
samplesPerSecond = 16000
audioFormat = AudioStreamFormat(samplesPerSecond, bitsPerSample, channels)

class CustomPullAudioInputStreamCallback(PullAudioInputStreamCallback):

    def __init__(self):
        return super(CustomPullAudioInputStreamCallback, self).__init__()

    def read(self, file_bytes):
        print (len(file_bytes))
        return len(file_bytes)

    def close(self):
        return super(CustomPullAudioInputStreamCallback, self).close()

class CustomPushAudioInputStream(PushAudioInputStream):

    def write(self, file_bytes):
        print (type(file_bytes))
        return super(CustomPushAudioInputStream, self).write(file_bytes)

    def close():
        return super(CustomPushAudioInputStream, self).close()

translation_config = speechsdk.translation.SpeechTranslationConfig(subscription=speech_key, region=service_region)

fromLanguage = 'en-US'
toLanguage = 'hi'
translation_config.speech_recognition_language = fromLanguage
translation_config.add_target_language(toLanguage)

translation_config.voice_name = "hi-IN-Kalpana-Apollo"


pull_audio_input_stream_callback = CustomPullAudioInputStreamCallback()
# pull_audio_input_stream = PullAudioInputStream(pull_audio_input_stream_callback, audioFormat)
# custom_pull_audio_input_stream = CustomPushAudioInputStream(audioFormat)

audio_config = AudioConfig(use_default_microphone=False, stream=pull_audio_input_stream_callback)
recognizer = speechsdk.translation.TranslationRecognizer(translation_config=translation_config,
                                                         audio_config=audio_config)


def synthesis_callback(evt):
        size = len(evt.result.audio)
        print('AUDIO SYNTHESIZED: {} byte(s) {}'.format(size, '(COMPLETED)' if size == 0 else ''))
        if size > 0:
            t_sound_file = open("translated_output.wav", "wb+")
            t_sound_file.write(evt.result.audio)
            t_sound_file.close()
        recognizer.stop_continuous_recognition_async()

def recognized_complete(evt):
    if evt.result.reason == speechsdk.ResultReason.TranslatedSpeech:
        print("RECOGNIZED '{}': {}".format(fromLanguage, result.text))
        print("TRANSLATED into {}: {}".format(toLanguage, result.translations['hi']))
    elif evt.result.reason == speechsdk.ResultReason.RecognizedSpeech:
        print("RECOGNIZED: {} (text could not be translated)".format(result.text))
    elif evt.result.reason == speechsdk.ResultReason.NoMatch:
        print("NOMATCH: Speech could not be recognized: {}".format(result.no_match_details))
    elif evt.reason == speechsdk.ResultReason.Canceled:
        print("CANCELED: Reason={}".format(result.cancellation_details.reason))
        if result.cancellation_details.reason == speechsdk.CancellationReason.Error:
            print("CANCELED: ErrorDetails={}".format(result.cancellation_details.error_details))

def receiving_bytes(audio_bytes):
    # audio_bytes contain bytes of audio to be translated
    recognizer.synthesizing.connect(synthesis_callback)
    recognizer.recognized.connect(recognized_complete)

    pull_audio_input_stream_callback.read(audio_bytes)
    recognizer.start_continuous_recognition_async()


receiving_bytes(audio_bytes)

Output:
Error: AttributeError: ‘PullAudioInputStreamCallback’ object has no attribute ‘_impl’

Packages and their versions:

Python 3.6.3
azure-cognitiveservices-speech 1.11.0

File Translation can be successfully performed but I do not want to save files for each chunk of bytes I receive.

Can you me pass custom audio bytes to the Azure Speech Translation Service and get the result in Python? If yes then how?

Asked By: Tanmay Virkar

||

Answers:

The example code provided uses a callback as the stream parameter to AudioConfig, which doesn’t seem to be allowed.

This code should work without throwing an error:

pull_audio_input_stream_callback = CustomPullAudioInputStreamCallback()
pull_audio_input_stream = PullAudioInputStream(pull_stream_callback=pull_audio_input_stream_callback, stream_format=audioFormat)

audio_config = AudioConfig(use_default_microphone=False, stream=pull_audio_input_stream)
Answered By: glenn

I got the solution to the problem by myself. I think it works with PullAudioInputStream too. But it worked for me using PushAudioInputStream. You don’t need to create custom classes it would work like the following:

import azure.cognitiveservices.speech as speechsdk
from azure.cognitiveservices.speech.audio import AudioStreamFormat, PullAudioInputStream, PullAudioInputStreamCallback, AudioConfig, PushAudioInputStream

from threading import Thread, Event


speech_key, service_region = "key", "region"

channels = 1
bitsPerSample = 16
samplesPerSecond = 16000
audioFormat = AudioStreamFormat(samplesPerSecond, bitsPerSample, channels)

translation_config = speechsdk.translation.SpeechTranslationConfig(subscription=speech_key, region=service_region)

fromLanguage = 'en-US'
toLanguage = 'hi'
translation_config.speech_recognition_language = fromLanguage
translation_config.add_target_language(toLanguage)

translation_config.voice_name = "hi-IN-Kalpana-Apollo"

# Remove Custom classes as they are not needed.

custom_push_stream = speechsdk.audio.PushAudioInputStream(stream_format=audioFormat)

audio_config = AudioConfig(stream=custom_push_stream)

recognizer = speechsdk.translation.TranslationRecognizer(translation_config=translation_config, audio_config=audio_config)

# Create an event
synthesis_done = Event()

def synthesis_callback(evt):
        size = len(evt.result.audio)
        print('AUDIO SYNTHESIZED: {} byte(s) {}'.format(size, '(COMPLETED)' if size == 0 else ''))
        if size > 0:
            t_sound_file = open("translated_output.wav", "wb+")
            t_sound_file.write(evt.result.audio)
            t_sound_file.close()
        # Setting the event
        synthesis_done.set()

def recognized_complete(evt):
    if evt.result.reason == speechsdk.ResultReason.TranslatedSpeech:
        print("RECOGNIZED '{}': {}".format(fromLanguage, result.text))
        print("TRANSLATED into {}: {}".format(toLanguage, result.translations['hi']))
    elif evt.result.reason == speechsdk.ResultReason.RecognizedSpeech:
        print("RECOGNIZED: {} (text could not be translated)".format(result.text))
    elif evt.result.reason == speechsdk.ResultReason.NoMatch:
        print("NOMATCH: Speech could not be recognized: {}".format(result.no_match_details))
    elif evt.reason == speechsdk.ResultReason.Canceled:
        print("CANCELED: Reason={}".format(result.cancellation_details.reason))
        if result.cancellation_details.reason == speechsdk.CancellationReason.Error:
            print("CANCELED: ErrorDetails={}".format(result.cancellation_details.error_details))


recognizer.synthesizing.connect(synthesis_callback)
recognizer.recognized.connect(recognized_complete)

# Read and get data from an audio file
open_audio_file = open("speech_wav_audio.wav", 'rb')
file_bytes = open_audio_file.read()

# Write the bytes to the stream
custom_push_stream.write(file_bytes)
custom_push_stream.close()

# Start the recognition
recognizer.start_continuous_recognition()

# Waiting for the event to complete
synthesis_done.wait()

# Once the event gets completed you can call Stop recognition
recognizer.stop_continuous_recognition()

I have used Event from thread since start_continuous_recognition starts in a different thread and you won’t get data from callback events if you don’t use threading. synthesis_done.wait will solve this problem by waiting for the event to complete and only then will call the stop_continuous_recognition. Once you obtain the audio bytes you can do whatever you wish in the synthesis_callback. I have simplified the example and took bytes from a wav file.

Answered By: Tanmay Virkar

I attempted using the custom push stream to pass incoming stream data to recognizer . My flask app shows proper logs when the incoming stream is received.

POST TwiML
[2023-03-20 11:20:41,040] INFO in wssserver: Connection accepted
[2023-03-20 11:20:41,182] INFO in wssserver: Connected Message received: {"event":"connected","protocol":"Call","version":"0.2.0"}
[2023-03-20 11:20:41,184] INFO in wssserver: Start Message received: {"event":"start","sequenceNumber":"1","start":{"accountSid":"AC6bdfe2517ccb244ad1b8866afa2740d5","streamSid":"MZ9fc4e3b4cb0f724ab2979c210f4151d5","callSid":"CA0ea96b4bc3ce6f438b4034603629985e","tracks":["inbound"],"mediaFormat":{"encoding":"audio/x-mulaw","sampleRate":8000,"channels":1}},"streamSid":"MZ9fc4e3b4cb0f724ab2979c210f4151d5"}
[2023-03-20 11:20:41,190] INFO in wssserver: Media message: {"event":"media","sequenceNumber":"2","media":{"track":"inbound","chunk":"1","timestamp":"160","payload":"fuTZ0s/P1NzvbF1WVFZYYXTt3dra3ef+ZlxVU1RYYXzi2NHP0Nff+GpcWVdaXWRv9OLb19nb5/ZtZmFhZ2189Ozq6Ons9nptYl5cXWJvfH56bWxsfPzt8u708uzj3Nva4+pyaV9eXl9iZWptevLk3NfW193n+nBqZ2pyfu/w8HhrX1xbWVtbWlpbXWZ06t3V09DT1NfZ2t3f6/JwY1tWUg=="},"streamSid":"MZ9fc4e3b4cb0f724ab2979c210f4151d5"}
[2023-03-20 11:20:41,193] INFO in wssserver: Payload is: fuTZ0s/P1NzvbF1WVFZYYXTt3dra3ef+ZlxVU1RYYXzi2NHP0Nff+GpcWVdaXWRv9OLb19nb5/ZtZmFhZ2189Ozq6Ons9nptYl5cXWJvfH56bWxsfPzt8u708uzj3Nva4+pyaV9eXl9iZWptevLk3NfW193n+nBqZ2pyfu/w8HhrX1xbWVtbWlpbXWZ06t3V09DT1NfZ2t3f6/JwY1tWUg==
[2023-03-20 11:20:41,194] INFO in wssserver: That’s 160 bytes
SESSION STARTED: SessionEventArgs(session_id=5321c5eca95b48e4a7c1288b0c7b7798)

However after the media from the payload is pushed to the custom push stream in line 201 , it breaks.

My complete code

My Error Log

Answered By: Rajesh Rajamani