find length of sequences of identical values in a numpy array (run length encoding)

Question

In a pylab program (which could probably be a matlab program as well) I have a numpy array of numbers representing distances: d[t] is the distance at time t (and the timespan of my data is len(d) time units).

The events I’m interested in are when the distance is below a certain threshold, and I want to compute the duration of these events. It’s easy to get an array of booleans with b = d<threshold, and the problem comes down to computing the sequence of the lengths of the True-only words in b. But I do not know how to do that efficiently (i.e. using numpy primitives), and I resorted to walk the array and to do manual change detection (i.e. initialize counter when value goes from False to True, increase counter as long as value is True, and output the counter to the sequence when value goes back to False). But this is tremendously slow.

How to efficienly detect that sort of sequences in numpy arrays ?

Below is some python code that illustrates my problem : the fourth dot takes a very long time to appear (if not, increase the size of the array)

from pylab import *

threshold = 7

print '.'
d = 10*rand(10000000)

print '.'

b = d<threshold

print '.'

durations=[]
for i in xrange(len(b)):
    if b[i] and (i==0 or not b[i-1]):
        counter=1
    if  i>0 and b[i-1] and b[i]:
        counter+=1
    if (b[i-1] and not b[i]) or i==len(b)-1:
        durations.append(counter)

print '.'

Asked By: Gyom

||

Source

Answer 1

durations = []
counter   = 0

for bool in b:
    if bool:
        counter += 1
    elif counter > 0:
        durations.append(counter)
        counter = 0

if counter > 0:
    durations.append(counter)

Answered By: John Kugelman

Answer 2

While not numpy primitives, itertools functions are often very fast, so do give this one a try (and measure times for various solutions including this one, of course):

def runs_of_ones(bits):
  for bit, group in itertools.groupby(bits):
    if bit: yield sum(group)

If you do need the values in a list, just can use list(runs_of_ones(bits)), of course; but maybe a list comprehension might be marginally faster still:

def runs_of_ones_list(bits):
  return [sum(g) for b, g in itertools.groupby(bits) if b]

Moving to “numpy-native” possibilities, what about:

def runs_of_ones_array(bits):
  # make sure all runs of ones are well-bounded
  bounded = numpy.hstack(([0], bits, [0]))
  # get 1 at run starts and -1 at run ends
  difs = numpy.diff(bounded)
  run_starts, = numpy.where(difs > 0)
  run_ends, = numpy.where(difs < 0)
  return run_ends - run_starts

Again: be sure to benchmark solutions against each others in realistic-for-you examples!

Answered By: Alex Martelli

Answer 3

Just in case anyone is curious (and since you mentioned MATLAB in passing), here’s one way to solve it in MATLAB:

threshold = 7;
d = 10*rand(1,100000);  % Sample data
b = diff([false (d < threshold) false]);
durations = find(b == -1)-find(b == 1);

I’m not too familiar with Python, but maybe this could help give you some ideas. =)

Answered By: gnovice

Answer 4

Here is a solution using only arrays: it takes an array containing a sequence of bools and counts the length of the transitions.

>>> from numpy import array, arange
>>> b = array([0,0,0,1,1,1,0,0,0,1,1,1,1,0,0], dtype=bool)
>>> sw = (b[:-1] ^ b[1:]); print sw
[False False  True False False  True False False  True False False False
  True False]
>>> isw = arange(len(sw))[sw]; print isw
[ 2  5  8 12]
>>> lens = isw[1::2] - isw[::2]; print lens
[3 4]

sw contains a true where there is a switch, isw converts them in indexes. The items of isw are then subtracted pairwise in lens.

Notice that if the sequence started with an 1 it would count the length of the 0s sequences: this can be fixed in the indexing to compute lens. Also, I have not tested corner cases such sequences of length 1.

Full function that returns start positions and lengths of all True-subarrays.

import numpy as np

def count_adjacent_true(arr):
    assert len(arr.shape) == 1
    assert arr.dtype == np.bool
    if arr.size == 0:
        return np.empty(0, dtype=int), np.empty(0, dtype=int)
    sw = np.insert(arr[1:] ^ arr[:-1], [0, arr.shape[0]-1], values=True)
    swi = np.arange(sw.shape[0])[sw]
    offset = 0 if arr[0] else 1
    lengths = swi[offset+1::2] - swi[offset:-1:2]
    return swi[offset:-1:2], lengths

Tested for different bool 1D-arrays (empty array; single/multiple elements; even/odd lengths; started with True/False; with only True/False elements).

Answered By: piro

Answer 5

Fully numpy vectorized and generic RLE for any array (works with strings, booleans etc too).

Outputs tuple of run lengths, start positions, and values.

import numpy as np

def rle(inarray):
        """ run length encoding. Partial credit to R rle function. 
            Multi datatype arrays catered for including non Numpy
            returns: tuple (runlengths, startpositions, values) """
        ia = np.asarray(inarray)                # force numpy
        n = len(ia)
        if n == 0: 
            return (None, None, None)
        else:
            y = ia[1:] != ia[:-1]               # pairwise unequal (string safe)
            i = np.append(np.where(y), n - 1)   # must include last element posi
            z = np.diff(np.append(-1, i))       # run lengths
            p = np.cumsum(np.append(0, z))[:-1] # positions
            return(z, p, ia[i])

Pretty fast (i7):

xx = np.random.randint(0, 5, 1000000)
%timeit yy = rle(xx)
100 loops, best of 3: 18.6 ms per loop

Multiple data types:

rle([True, True, True, False, True, False, False])
Out[8]: 
(array([3, 1, 1, 2]),
 array([0, 3, 4, 5]),
 array([ True, False,  True, False], dtype=bool))

rle(np.array([5, 4, 4, 4, 4, 0, 0]))
Out[9]: (array([1, 4, 2]), array([0, 1, 5]), array([5, 4, 0]))

rle(["hello", "hello", "my", "friend", "okay", "okay", "bye"])
Out[10]: 
(array([2, 1, 1, 2, 1]),
 array([0, 2, 3, 4, 6]),
 array(['hello', 'my', 'friend', 'okay', 'bye'], 
       dtype='|S6'))

Same results as Alex Martelli above:

xx = np.random.randint(0, 2, 20)

xx
Out[60]: array([1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1])

am = runs_of_ones_array(xx)

tb = rle(xx)

am
Out[63]: array([4, 5, 2, 5])

tb[0][tb[2] == 1]
Out[64]: array([4, 5, 2, 5])

%timeit runs_of_ones_array(xx)
10000 loops, best of 3: 28.5 µs per loop

%timeit rle(xx)
10000 loops, best of 3: 38.2 µs per loop

Slightly slower than Alex (but still very fast), and much more flexible.

Answered By: Thomas Browne

Answer 6

Perhaps late to the party, but a Numba-based approach is going to be fastest by far and large.

import numpy as np
import numba as nb


@nb.njit
def count_steps_nb(arr):
    result = 1
    last_x = arr[0]
    for x in arr[1:]:
        if last_x != x:
            result += 1
            last_x = x
    return result


@nb.njit
def rle_nb(arr):
    n = count_steps_nb(arr)
    values = np.empty(n, dtype=arr.dtype)
    lengths = np.empty(n, dtype=np.int_)
    last_x = arr[0]
    length = 1
    i = 0
    for x in arr[1:]:
        if last_x != x:
            values[i] = last_x
            lengths[i] = length
            i += 1
            length = 1
            last_x = x
        else:
            length += 1
    values[i] = last_x
    lengths[i] = length
    return values, lengths

Numpy-based approaches (inspired by @ThomasBrowne answer but faster because the use of the expensive numpy.concatenate() is reduced to a minimum) are the runner-up (here two similar approaches are presented, one using not-equality to find the positions of the steps, and the other one using differences):

import numpy as np


def rle_np_neq(arr):
    n = len(arr)
    if n == 0:
        values = np.empty(0, dtype=arr.dtype)
        lengths = np.empty(0, dtype=np.int_)
    else:
        positions = np.concatenate(
            [[-1], np.nonzero(arr[1:] != arr[:-1])[0], [n - 1]])
        lengths = positions[1:] - positions[:-1]
        values = arr[positions[1:]]
    return values, lengths

import numpy as np


def rle_np_diff(arr):
    n = len(arr)
    if n == 0:
        values = np.empty(0, dtype=arr.dtype)
        lengths = np.empty(0, dtype=np.int_)
    else:
        positions = np.concatenate(
            [[-1], np.nonzero(arr[1:] - arr[:-1])[0], [n - 1]])
        lengths = positions[1:] - positions[:-1]
        values = arr[positions[1:]]
        return values, lengths

These both outperform the naïve and simple single loop approach:

import numpy as np


def rle_loop(arr):
    values = []
    lengths = []
    last_x = arr[0]
    length = 1
    for x in arr[1:]:
        if last_x != x:
            values.append(last_x)
            lengths.append(length)
            length = 1
            last_x = x
        else:
            length += 1
    values.append(last_x)
    lengths.append(length)
    return np.array(values), np.array(lengths)

On the contrary, using itertools.groupby() is not going to be any faster than the simple loop (unless on very special cases like in @AlexMartelli answer or someone will implement __len__ on the group object) because in general there is no simple way of extracting the group size information other than looping through the group itself, which is not exactly fast:

import itertools
import numpy as np


def rle_groupby(arr):
    values = []
    lengths = []
    for x, group in itertools.groupby(arr):
        values.append(x)
        lengths.append(sum(1 for _ in group))
    return np.array(values), np.array(lengths)

The results of some benchmarks on random integer arrays of varying size are reported:

(Full analysis here).

Answered By: norok2

Answer 7

import numpy as np
a = np.array([3, 3, 3, 1, 3, 3, 3, 3, 1, 1, 3, 3, 3, 3, 3])
is_not_in_run = (a != 3)
is_not_in_run = np.concatenate(([True], is_not_in_run, [True]))
gap_indices = np.where(is_not_in_run)[0]
run_lengths = np.diff(gap_indices) - 1
run_lengths = np.delete(run_lengths, np.where(run_lengths == 0)[0])
print(run_lengths)

Answered By: Jack

find length of sequences of identical values in a numpy array (run length encoding)

Question:

Answers: