How to find the groups of consecutive elements in a NumPy array
Question:
I have to cluster the consecutive elements from a NumPy array. Considering the following example
a = [ 0, 47, 48, 49, 50, 97, 98, 99]
The output should be a list of tuples as follows
[(0), (47, 48, 49, 50), (97, 98, 99)]
Here the difference is just one between the elements. It will be great if the difference can also be specified as a limit or a hardcoded number.
Answers:
This sounds a little like homework, so if you dont mind I will suggest an approach
You can iterate over a list using
for i in range(len(a)):
print a[i]
You could test the next element in the list meets some criteria like follows
if a[i] == a[i] + 1:
print "it must be a consecutive run"
And you can store results seperately in
results = []
Beware – there is an index out of range error hidden in the above you will need to deal with
(a[1:]-a[:-1])==1
will produce a boolean array where False
indicates breaks in the runs. You can also use the built-in numpy.grad.
Here’s a lil func that might help:
def group_consecutives(vals, step=1):
"""Return list of consecutive lists of numbers from vals (number list)."""
run = []
result = [run]
expect = None
for v in vals:
if (v == expect) or (expect is None):
run.append(v)
else:
run = [v]
result.append(run)
expect = v + step
return result
>>> group_consecutives(a)
[[0], [47, 48, 49, 50], [97, 98, 99]]
>>> group_consecutives(a, step=47)
[[0, 47], [48], [49], [50, 97], [98], [99]]
P.S. This is pure Python. For a NumPy solution, see unutbu’s answer.
this is what I came up so far: not sure is 100% correct
import numpy as np
a = np.array([ 0, 47, 48, 49, 50, 97, 98, 99])
print np.split(a, np.cumsum( np.where(a[1:] - a[:-1] > 1) )+1)
returns:
>>>[array([0]), array([47, 48, 49, 50]), array([97, 98, 99])]
def consecutive(data, stepsize=1):
return np.split(data, np.where(np.diff(data) != stepsize)[0]+1)
a = np.array([0, 47, 48, 49, 50, 97, 98, 99])
consecutive(a)
yields
[array([0]), array([47, 48, 49, 50]), array([97, 98, 99])]
Tested for one dimensional arrays
Get where diff isn’t one
diffs = numpy.diff(array) != 1
Get the indexes of diffs, grab the first dimension and add one to all because diff compares with the previous index
indexes = numpy.nonzero(diffs)[0] + 1
Split with the given indexes
groups = numpy.split(array, indexes)
It turns out that instead of np.split
, list comprehension is more performative. So the below function (almost like @unutbu’s consecutive
function except it uses a list comprehension to split the array) is much faster:
def consecutive_w_list_comprehension(arr, stepsize=1):
idx = np.r_[0, np.where(np.diff(arr) != stepsize)[0]+1, len(arr)]
return [arr[i:j] for i,j in zip(idx, idx[1:])]
For example, for an array of length 100_000, consecutive_w_list_comprehension
is over 4x faster:
arr = np.sort(np.random.choice(range(150000), size=100000, replace=False))
%timeit -n 100 consecutive(arr)
96.1 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit -n 100 consecutive_w_list_comprehension(arr)
23.2 ms ± 858 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In fact, this relationship holds up no matter the size of the array. The plot below shows the runtime difference between the answers on here.
Code used to produce the plot above:
import perfplot
import numpy as np
def consecutive(data, stepsize=1):
return np.split(data, np.where(np.diff(data) != stepsize)[0]+1)
def consecutive_w_list_comprehension(arr, stepsize=1):
idx = np.r_[0, np.where(np.diff(arr) != stepsize)[0]+1, len(arr)]
return [arr[i:j] for i,j in zip(idx, idx[1:])]
def group_consecutives(vals, step=1):
run = []
result = [run]
expect = None
for v in vals:
if (v == expect) or (expect is None):
run.append(v)
else:
run = [v]
result.append(run)
expect = v + step
return result
def JozeWs(array):
diffs = np.diff(array) != 1
indexes = np.nonzero(diffs)[0] + 1
groups = np.split(array, indexes)
return groups
perfplot.show(
setup = lambda n: np.sort(np.random.choice(range(2*n), size=n, replace=False)),
kernels = [consecutive, consecutive_w_list_comprehension, group_consecutives, JozeWs],
labels = ['consecutive', 'consecutive_w_list_comprehension', 'group_consecutives', 'JozeWs'],
n_range = [2 ** k for k in range(5, 22)],
equality_check = lambda *lst: all((x==y).all() for x,y in zip(*lst)),
xlabel = '~len(arr)'
)
I have to cluster the consecutive elements from a NumPy array. Considering the following example
a = [ 0, 47, 48, 49, 50, 97, 98, 99]
The output should be a list of tuples as follows
[(0), (47, 48, 49, 50), (97, 98, 99)]
Here the difference is just one between the elements. It will be great if the difference can also be specified as a limit or a hardcoded number.
This sounds a little like homework, so if you dont mind I will suggest an approach
You can iterate over a list using
for i in range(len(a)):
print a[i]
You could test the next element in the list meets some criteria like follows
if a[i] == a[i] + 1:
print "it must be a consecutive run"
And you can store results seperately in
results = []
Beware – there is an index out of range error hidden in the above you will need to deal with
(a[1:]-a[:-1])==1
will produce a boolean array where False
indicates breaks in the runs. You can also use the built-in numpy.grad.
Here’s a lil func that might help:
def group_consecutives(vals, step=1):
"""Return list of consecutive lists of numbers from vals (number list)."""
run = []
result = [run]
expect = None
for v in vals:
if (v == expect) or (expect is None):
run.append(v)
else:
run = [v]
result.append(run)
expect = v + step
return result
>>> group_consecutives(a)
[[0], [47, 48, 49, 50], [97, 98, 99]]
>>> group_consecutives(a, step=47)
[[0, 47], [48], [49], [50, 97], [98], [99]]
P.S. This is pure Python. For a NumPy solution, see unutbu’s answer.
this is what I came up so far: not sure is 100% correct
import numpy as np
a = np.array([ 0, 47, 48, 49, 50, 97, 98, 99])
print np.split(a, np.cumsum( np.where(a[1:] - a[:-1] > 1) )+1)
returns:
>>>[array([0]), array([47, 48, 49, 50]), array([97, 98, 99])]
def consecutive(data, stepsize=1):
return np.split(data, np.where(np.diff(data) != stepsize)[0]+1)
a = np.array([0, 47, 48, 49, 50, 97, 98, 99])
consecutive(a)
yields
[array([0]), array([47, 48, 49, 50]), array([97, 98, 99])]
Tested for one dimensional arrays
Get where diff isn’t one
diffs = numpy.diff(array) != 1
Get the indexes of diffs, grab the first dimension and add one to all because diff compares with the previous index
indexes = numpy.nonzero(diffs)[0] + 1
Split with the given indexes
groups = numpy.split(array, indexes)
It turns out that instead of np.split
, list comprehension is more performative. So the below function (almost like @unutbu’s consecutive
function except it uses a list comprehension to split the array) is much faster:
def consecutive_w_list_comprehension(arr, stepsize=1):
idx = np.r_[0, np.where(np.diff(arr) != stepsize)[0]+1, len(arr)]
return [arr[i:j] for i,j in zip(idx, idx[1:])]
For example, for an array of length 100_000, consecutive_w_list_comprehension
is over 4x faster:
arr = np.sort(np.random.choice(range(150000), size=100000, replace=False))
%timeit -n 100 consecutive(arr)
96.1 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit -n 100 consecutive_w_list_comprehension(arr)
23.2 ms ± 858 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In fact, this relationship holds up no matter the size of the array. The plot below shows the runtime difference between the answers on here.
Code used to produce the plot above:
import perfplot
import numpy as np
def consecutive(data, stepsize=1):
return np.split(data, np.where(np.diff(data) != stepsize)[0]+1)
def consecutive_w_list_comprehension(arr, stepsize=1):
idx = np.r_[0, np.where(np.diff(arr) != stepsize)[0]+1, len(arr)]
return [arr[i:j] for i,j in zip(idx, idx[1:])]
def group_consecutives(vals, step=1):
run = []
result = [run]
expect = None
for v in vals:
if (v == expect) or (expect is None):
run.append(v)
else:
run = [v]
result.append(run)
expect = v + step
return result
def JozeWs(array):
diffs = np.diff(array) != 1
indexes = np.nonzero(diffs)[0] + 1
groups = np.split(array, indexes)
return groups
perfplot.show(
setup = lambda n: np.sort(np.random.choice(range(2*n), size=n, replace=False)),
kernels = [consecutive, consecutive_w_list_comprehension, group_consecutives, JozeWs],
labels = ['consecutive', 'consecutive_w_list_comprehension', 'group_consecutives', 'JozeWs'],
n_range = [2 ** k for k in range(5, 22)],
equality_check = lambda *lst: all((x==y).all() for x,y in zip(*lst)),
xlabel = '~len(arr)'
)