Find large number of consecutive values fulfilling condition in a numpy array

Question:

I have some audio data loaded in a numpy array and I wish to segment the data by finding silent parts, i.e. parts where the audio amplitude is below a certain threshold over a period in time.

An extremely simple way to do this is something like this:

values = ''.join(("1" if (abs(x) < SILENCE_THRESHOLD) else "0" for x in samples))
pattern = re.compile('1{%d,}'%int(MIN_SILENCE))                                                                           
for match in pattern.finditer(values):
   # code goes here

The code above finds parts where there are at least MIN_SILENCE consecutive elements smaller than SILENCE_THRESHOLD.

Now, obviously, the above code is horribly inefficient and a terrible abuse of regular expressions. Is there some other method that is more efficient, but still results in equally simple and short code?

Asked By: pafcu

||

Answers:

I haven’t tested this but you it should be close to what you are looking for. Slightly more lines of code but should be more efficient, readable, and it doesn’t abuse regular expressions 🙂

def find_silent(samples):
    num_silent = 0
    start = 0
    for index in range(0, len(samples)):
        if abs(samples[index]) < SILENCE_THRESHOLD:
            if num_silent == 0:
                start = index
            num_silent += 1
        else:
            if num_silent > MIN_SILENCE:
                yield samples[start:index]
            num_silent = 0
    if num_silent > MIN_SILENCE:
        yield samples[start:]

for match in find_silent(samples):
    # code goes here
Answered By: Andrew Clark

This should return a list of (start,length) pairs:

def silent_segs(samples,threshold,min_dur):
  start = -1
  silent_segments = []
  for idx,x in enumerate(samples):
    if start < 0 and abs(x) < threshold:
      start = idx
    elif start >= 0 and abs(x) >= threshold:
      dur = idx-start
      if dur >= min_dur:
        silent_segments.append((start,dur))
      start = -1
  return silent_segments

And a simple test:

>>> s = [-1,0,0,0,-1,10,-10,1,2,1,0,0,0,-1,-10]
>>> silent_segs(s,2,2)
[(0, 5), (9, 5)]
Answered By: job

Slightly sloppy, but simple and fast-ish, if you don’t mind using scipy:

from scipy.ndimage import gaussian_filter
sigma = 3
threshold = 1
above_threshold = gaussian_filter(data, sigma=sigma) > threshold

The idea is that quiet portions of the data will smooth down to low amplitude, and loud regions won’t. Tune ‘sigma’ to affect how long a ‘quiet’ region must be; tune ‘threshold’ to affect how quiet it must be. This slows down for large sigma, at which point using FFT-based smoothing might be faster.

This has the added benefit that single ‘hot pixels’ won’t disrupt your silence-finding, so you’re a little less sensitive to certain types of noise.

Answered By: Andrew

Here’s a numpy-based solution.

I think (?) it should be faster than the other options. Hopefully it’s fairly clear.

However, it does require a twice as much memory as the various generator-based solutions. As long as you can hold a single temporary copy of your data in memory (for the diff), and a boolean array of the same length as your data (1-bit-per-element), it should be pretty efficient…

import numpy as np

def main():
    # Generate some random data
    x = np.cumsum(np.random.random(1000) - 0.5)
    condition = np.abs(x) < 1
    
    # Print the start and stop indices of each region where the absolute 
    # values of x are below 1, and the min and max of each of these regions
    for start, stop in contiguous_regions(condition):
        segment = x[start:stop]
        print start, stop
        print segment.min(), segment.max()

def contiguous_regions(condition):
    """Finds contiguous True regions of the boolean array "condition". Returns
    a 2D array where the first column is the start index of the region and the
    second column is the end index."""

    # Find the indicies of changes in "condition"
    d = np.diff(condition)
    idx, = d.nonzero() 

    # We need to start things after the change in "condition". Therefore, 
    # we'll shift the index by 1 to the right.
    idx += 1

    if condition[0]:
        # If the start of condition is True prepend a 0
        idx = np.r_[0, idx]

    if condition[-1]:
        # If the end of condition is True, append the length of the array
        idx = np.r_[idx, condition.size] # Edit

    # Reshape the result into two columns
    idx.shape = (-1,2)
    return idx

main()
Answered By: Joe Kington

another way to do this quickly and concisely:

import pylab as pl

v=[0,0,1,1,0,0,1,1,1,1,1,0,1,0,1,1,0,0,0,0,0,1,0,0]
vd = pl.diff(v)
#vd[i]==1 for 0->1 crossing; vd[i]==-1 for 1->0 crossing
#need to add +1 to indexes as pl.diff shifts to left by 1

i1=pl.array([i for i in xrange(len(vd)) if vd[i]==1])+1
i2=pl.array([i for i in xrange(len(vd)) if vd[i]==-1])+1

#corner cases for the first and the last element
if v[0]==1:
  i1=pl.hstack((0,i1))
if v[-1]==1:
  i2=pl.hstack((i2,len(v)))

now i1 contains the beginning index and i2 the end index of 1,…,1 areas

Answered By: Brano

There is a very convenient solution to this using scipy.ndimage. For an array:

a = np.array([1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0])

which can be the result of a condition applied to another array, finding the contiguous regions is as simple as:

regions = scipy.ndimage.find_objects(scipy.ndimage.label(a)[0])

Then, applying any function to those regions can be done e.g. like:

[np.sum(a[r]) for r in regions]
Answered By: Andrzej Pronobis

@joe-kington I’ve got about 20%-25% speed improvement over np.diff / np.nonzero solution by using argmax instead (see code below, condition is boolean)

def contiguous_regions(condition):
    idx = []
    i = 0
    while i < len(condition):
        x1 = i + condition[i:].argmax()
        try:
            x2 = x1 + condition[x1:].argmin()
        except:
            x2 = x1 + 1
        if x1 == x2:
            if condition[x1] == True:
                x2 = len(condition)
            else:
                break
        idx.append( [x1,x2] )
        i = x2
    return idx

Of course, your mileage may vary depending on your data.

Besides, I’m not entirely sure, but i guess numpy may optimize argmin/argmax over boolean arrays to stop searching on first True/False occurrence. That might explain it.

Answered By: user2154321

I know I’m late to the party, but another way to do this is with 1d convolutions:

np.convolve(sig > threshold, np.ones((cons_samples)), 'same') == cons_samples

Where cons_samples is the number of consecutive samples you require above threshold

Answered By: DankMasterDan
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.