How can I do real-time voice activity detection in Python?

Question:

I am performing a voice activity detection on the recorded audio file to detect speech vs non-speech portions in the waveform.

The output of the classifier looks like (highlighted green regions indicate speech):

enter image description here

The only issue I face here is making it work for a stream of audio input (for eg: from a microphone) and do real-time analysis for a stipulated time-frame.

I know PyAudio can be used to record speech from the microphone dynamically and there a couple of real-time visualization examples of a waveform, spectrum, spectrogram, etc, but could not find anything relevant to carrying out feature extraction in a near real-time manner.

Asked By: Nickil Maveli

||

Answers:

Audio usually has a low bitrate, so I don’t see any problem of writing your code completely in numpy and python. And if you need low-level array access consider numba. Also profile your code e.g. with line_profiler. ALso note there is scipy.signal for more advanced signal processing.

Usually audio processing works in samples. So you define a sample size for your process, and then run a method to decide if that sample contains speech or not.

import numpy as np

def main_loop():
    stream = <create stream with your audio library>
    while True:
        sample = stream.readframes(<define number of samples / time to read>)
        print(is_speech(sample))

def is_speech(sample):
    audio = np.array(sample)

    < do you processing >

    # e.g. simple loudness test
    return np.any(audio > 0.8):

That should get you pretty far.

Answered By: Chris

I think there are two approaches here,

  1. Threshold Approach
  2. Small, deployable, Neural net. Approach

The first one is fast, feasible and can be implemented and tested very fast. while the second one is a bit more difficult to implement. I think you are a bit familiar with 2nd option already.

in the case of the 2nd approach, you will be needing a dataset of speeches that are labeled in a sequence of binary classification like 00000000111111110000000011110000. The neural net should be small and optimized for running on edge devices like mobile.

You can check this out from TensorFlow

This is a voice activity detector. I think it’s for your purpose.

Also, check these out.

https://github.com/eesungkim/Voice_Activity_Detector

https://github.com/pyannote/pyannote-audio

of course, you should compare performance of the mentioned toolkits and models and the feasibility of the implementation of mobile devices.

Answered By: Amin Taheri

You should try using Python bindings to webRTC VAD from Google. It’s lightweight, fast and provides very reasonable results, based on GMM modelling. As the decision is provided per frame, the latency is minimal.

# Run the VAD on 10 ms of silence. The result should be False.
import webrtcvad
vad = webrtcvad.Vad(2)

sample_rate = 16000
frame_duration = 10  # ms
frame = b'x00x00' * int(sample_rate * frame_duration / 1000)
print('Contains speech: %s' % (vad.is_speech(frame, sample_rate))

Also, this article might be useful for you.

UPDATE December 2022

As the topic still draws attention, I’d like to update my answer. SileroVAD is very fast and very accurate VAD that was released recently under MIT license.

Answered By: igrinis

I found out that LibROSA could be one of the solutions to your problem. There’s a simple tutorial on Medium on using Microphone streaming to realise real-time prediction.

Let’s use Short-Time Fourier Transform (STFT) as the feature extractor, the author explains:

To calculate STFT, Fast Fourier transform window size(n_fft) is used
as 512. According to the equation n_stft = n_fft/2 + 1, 257 frequency
bins(n_stft) are calculated over a window size of 512. The window is
moved by a hop length of 256 to have a better overlapping of the
windows in calculating the STFT.

stft = np.abs(librosa.stft(trimmed, n_fft=512, hop_length=256, win_length=512))

# Plot audio with zoomed in y axis
def plotAudio(output):
    fig, ax = plt.subplots(nrows=1,ncols=1, figsize=(20,10))
    plt.plot(output, color='blue')
    ax.set_xlim((0, len(output)))
    ax.margins(2, -0.1)
    plt.show()

# Plot audio
def plotAudio2(output):
    fig, ax = plt.subplots(nrows=1,ncols=1, figsize=(20,4))
    plt.plot(output, color='blue')
    ax.set_xlim((0, len(output)))
    plt.show()

def minMaxNormalize(arr):
    mn = np.min(arr)
    mx = np.max(arr)
    return (arr-mn)/(mx-mn)

def predictSound(X):
    clip, index = librosa.effects.trim(X, top_db=20, frame_length=512, hop_length=64) # Empherically select top_db for every sample
    stfts = np.abs(librosa.stft(clip, n_fft=512, hop_length=256, win_length=512))
    stfts = np.mean(stfts,axis=1)
    stfts = minMaxNormalize(stfts)
    result = model.predict(np.array([stfts]))
    predictions = [np.argmax(y) for y in result]
    print(lb.inverse_transform([predictions[0]])[0])
    plotAudio2(clip)

CHUNKSIZE = 22050 # fixed chunk size
RATE = 22050

p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paFloat32, channels=1, 
rate=RATE, input=True, frames_per_buffer=CHUNKSIZE)

#preprocessing the noise around
#noise window
data = stream.read(10000)
noise_sample = np.frombuffer(data, dtype=np.float32)
print("Noise Sample")
plotAudio2(noise_sample)
loud_threshold = np.mean(np.abs(noise_sample)) * 10
print("Loud threshold", loud_threshold)
audio_buffer = []
near = 0

while(True):
    # Read chunk and load it into numpy array.
    data = stream.read(CHUNKSIZE)
    current_window = np.frombuffer(data, dtype=np.float32)
    
    #Reduce noise real-time
    current_window = nr.reduce_noise(audio_clip=current_window, noise_clip=noise_sample, verbose=False)
    
    if(audio_buffer==[]):
        audio_buffer = current_window
    else:
        if(np.mean(np.abs(current_window))<loud_threshold):
            print("Inside silence reign")
            if(near<10):
                audio_buffer = np.concatenate((audio_buffer,current_window))
                near += 1
            else:
                predictSound(np.array(audio_buffer))
                audio_buffer = []
                near
        else:
            print("Inside loud reign")
            near = 0
            audio_buffer = np.concatenate((audio_buffer,current_window))

# close stream
stream.stop_stream()
stream.close()
p.terminate()

Code credit to: Chathuranga Siriwardhana

Full code can be found here.

Answered By: Angus

I recently found this question while looking for an answer to the same question, thanks for all the suggestions. I found 3 more detectors. picovoice is by far better then webrtc. speechbrain and nvidia don’t support real-time which kinda sucks

-picovoice cobra: https://picovoice.ai/docs/cobra/
-speechbrain: https://speechbrain.readthedocs.io/en/latest/API/speechbrain.pretrained.interfaces.html#speechbrain.pretrained.interfaces.VAD
-nvidia: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speech_classification/models.html

Answered By: guestbusters