Split list based on when a pattern of consecutive numbering stops
Question:
I have an existing list. I want to break it up into separate lists whenever the following number is not equal to its preceding value.
x = [1,4,4,5,5,8,8,10,10,25,25,70,70,90,90,100,2,3,3,4,4,5,5,8,8,9,20,21,21,22,23)
The desired output should look like this:
a = [1,4,4,5,5,8,8,10,10,25,25,70,70,90,90,100)
b = [2,3,3,4,4,5,5,8,8,9)
c = [20,21,21,22]
d = [23]
Answers:
def group(l,skip=0):
prevind = 0
currind = skip+1
for val in l[currind::2]:
if val != l[currind-1]:
if currind-prevind-1 > 1: yield l[prevind:currind-1]
prevind = currind-1
currind += 2
if prevind != currind:
yield l[prevind:currind]
Which for the list you defined, returns when called with skip=1
[38, 1200, 1200, 306, 306, 391, 391, 82, 82, 35, 35, 902, 902, 955, 955]
[13, 955, 847, 847, 835, 835, 698, 698, 777, 777, 896, 896, 923, 923, 940, 940, 569, 569, 53, 53]
[411, 53, 1009, 1009]
[1884, 1009]
[878, 923, 886, 886, 511, 511, 942, 942, 1067, 1067, 1888, 1888, 243, 243, 1556]
And a simpler example list [1,1,3,3,2,5]
:
for g in group(l2):
print g
[1, 1, 3, 3]
[2, 5]
The reason skip
is an optional parameter to the function is that in your example 38 was included despite it not being equal to 1200. If this was an error, then simply remove skip and set currind
to equal 1
initially.
Explanation:
In a list [a,b,c,d,e,...]
. We want to compare two elements with each other in succession i.e a == b
, c == d
, and then when a comparison doesn’t return True
, capture all previous elements (excluding those already captured). To do this we need to keep track of where the last capture took place, which initially is 0
(i.e no captures). We then go over each of the pairs, by going over ever 2nd element in the list starting at currind
which by default (when not skipping an element) is one. And then compare the value we get from l[currind::2]
to the value before it l[currind-1]
. currind
is the index of each 2nd element from currind
‘s inital value (1
by default). If the values don’t match then we need to perform a capture but only if the resulting capture would contain a term! Hence currind-prevind-1
> 1 (because the list slice will be that length -1, so it needs to be 2 or more to extract at least 1 element). l[prevind:currind-1]
does this capture, going from the index of the last comparison which didn’t match (or 0
the default) up till the element before first value in each comparison pair a,b
or c,d
etc.. Then prevind
is set to currind-1
i.e the index of the last element captured. We then increment currind
by 2 to go to the index of the next val
. Then finally, if there was a pair left over we extract it.
So for [1,1,3,3,2,5]
:
val is 1, at index 1. comparing to value at 0 i.e 1
make currind the index of last element of the next pair
val is 3, at index 3. comparing to value at 2 i.e 3
make currind the index of last element of the next pair
val is 5, at index 5. comparing to value at 4 i.e 2
not equal so get slice between 0,4
[1, 1, 3, 3]
make currind the index of last element of the next pair #happens after the for loop
[2, 5]
In order to answer your question:
I have […] a list. I want to break it up into separate lists whenever the following number is not equal to its preceding value.
Have a look at itertools.groupby
.
Example:
import itertools
l = [38, 1200, 1200, 306, 306, 391, 391, 82, 82, 35, 35, 902, 902, 955, 955, 13]
for x, v in itertools.groupby(l):
# `v` is an iterator that yields all subsequent elements
# that have the same value
# `x` is that value
print list(v)
The output is:
$ python test.py
[38]
[1200, 1200]
[306, 306]
[391, 391]
[82, 82]
[35, 35]
[902, 902]
[955, 955]
[13]
Which is apparently what you are asking for?
As for your pattern thing, here’s some generator function that, at the very least, produces the output you expect for the given input:
import itertools
def split_sublists(input_list):
sublist = []
for val, l in itertools.groupby(input_list):
l = list(l)
if not sublist or len(l) == 2:
sublist += l
else:
sublist += l
yield sublist
sublist = []
yield sublist
input_list = [1,4,4,5,5,8,8,10,10,25,25,70,70,90,90,100,2,3,3,4,4,5,5,8,8,9,20,21,21,22,23]
for sublist in split_sublists(input_list):
print sublist
The output:
$ python test.py
[1, 4, 4, 5, 5, 8, 8, 10, 10, 25, 25, 70, 70, 90, 90, 100]
[2, 3, 3, 4, 4, 5, 5, 8, 8, 9]
[20, 21, 21, 22]
[23]
Here’s my ugly-ish solution for this:
x = [38, 1200, 1200, 306, 306, 391, 391, 82, 82, 35, 35, 902, 902, 955, 955, 13, 955, 847, 847, 835, 83, 5698, 698, 777, 777, 896, 896, 923, 923, 940, 940, 569, 569, 53, 53, 411]
def weird_split(alist):
sublist = []
for i, n in enumerate(alist[:-1]):
sublist.append(n)
# make sure we only create a new list if the current one is not empty
if len(sublist) > 1 and n != alist[i-1] and n != alist[i+1]:
yield sublist
sublist = []
# always add the last element
sublist.append(alist[-1])
yield sublist
for sublist in weird_split(x):
print sublist
And the output:
[38, 1200, 1200, 306, 306, 391, 391, 82, 82, 35, 35, 902, 902, 955, 955, 13]
[955, 847, 847, 835]
[83, 5698]
[698, 777, 777, 896, 896, 923, 923, 940, 940, 569, 569, 53, 53, 411]
Firstly, you haven’t defined behaviour for [1, 0, 0, 1, 0, 0, 1]
, so this splits it into [1, 0, 0, 1]
, [0, 0]
and [1]
.
Secondly, there are a lot of corner cases that need to be handled correctly, so it’s longer than you might expect. This would also be shorted if it directly put things into lists, but generators are a good thing so I made sure not to do that.
Firstly, use the full iterator interface instead of the yield
shortcut because it allows better sharing of state between the outer and inner iterables without making a new subsection
generator each iteration. A nested def
with yield
s might be able to do this in less space, but in this case the wordiness is acceptable, I think.
So, set-up:
class repeating_sections:
def __init__(self, iterable):
self.iter = iter(iterable)
try:
self._cache = next(self.iter)
self.finished = False
except StopIteration:
self.finished = True
We need to define the sub-iterator that yields until it finds a pair that doesn’t match. Because the end would be removed from the iterator we need to yield
it on the next call to _subsection
, so store it in _cache
.
def _subsection(self):
yield self._cache
try:
while True:
item1 = next(self.iter)
try:
item2 = next(self.iter)
except StopIteration:
yield item1
raise
if item1 == item2:
yield item1
yield item2
else:
yield item1
self._cache = item2
return
except StopIteration:
self.finished = True
__iter__
should return self
for iterables:
def __iter__(self):
return self
__next__
returns a subsection unless finished. Note that exhausting the section is important if behaiour is to be reliable.
def __next__(self):
if self.finished:
raise StopIteration
subsection = self._subsection()
return subsection
for item in subsection:
pass
Some tests:
for item in repeating_sections(x):
print(list(item))
#>>> [38, 1200, 1200, 306, 306, 391, 391, 82, 82, 35, 35, 902, 902, 955, 955, 13]
#>>> [955, 847, 847, 835, 835, 698, 698, 777, 777, 896, 896, 923, 923, 940, 940, 569, 569, 53, 53, 411]
#>>> [53, 1009, 1009, 1884]
#>>> [1009, 878]
#>>> [923, 886, 886, 511, 511, 942, 942, 1067, 1067, 1888, 1888, 243, 243, 1556]
for item in repeating_sections([1, 0, 0, 1, 0, 0, 1]):
print(list(item))
#>>> [1, 0, 0, 1]
#>>> [0, 0]
#>>> [1]
Some timings to show this wasn’t totally pointless:
SETUP="
x = [38, 1200, 1200, 306, 306, 391, 391, 82, 82, 35, 35, 902, 902, 955, 955, 13, 955, 847, 847, 835, 83, 5698, 698, 777, 777, 896, 896, 923, 923, 940, 940, 569, 569, 53, 53, 411]
x *= 5000
class repeating_sections:
def __init__(self, iterable):
self.iter = iter(iterable)
try:
self._cache = next(self.iter)
self.finished = False
except StopIteration:
self.finished = True
def _subsection(self):
yield self._cache
try:
while True:
item1 = next(self.iter)
try:
item2 = next(self.iter)
except StopIteration:
yield item1
raise
if item1 == item2:
yield item1
yield item2
else:
yield item1
self._cache = item2
return
except StopIteration:
self.finished = True
def __iter__(self):
return self
def __next__(self):
if self.finished:
raise StopIteration
subsection = self._subsection()
return subsection
for item in subsection:
pass
def weird_split(alist):
sublist = []
for i, n in enumerate(alist[:-1]):
sublist.append(n)
# make sure we only create a new list if the current one is not empty
if len(sublist) > 1 and n != alist[i-1] and n != alist[i+1]:
yield sublist
sublist = []
# always add the last element
sublist.append(alist[-1])
yield sublist
"
python -m timeit -s "$SETUP" "for section in repeating_sections(x):" " for item in section: pass"
python -m timeit -s "$SETUP" "for section in weird_split(x):" " for item in section: pass"
Result:
10 loops, best of 3: 150 msec per loop
10 loops, best of 3: 207 msec per loop
Not a massive difference, but it’s faster nonetheless.
The numpy version:
>>> inds = np.where(np.diff(x))[0]
>>> out = np.split(x,inds[np.diff(inds)==1][0::2]+2)
>>> for n in out:
... print n
[ 38 1200 1200 306 306 391 391 82 82 35 35 902 902 955 955
13]
[955 847 847 835 835 698 698 777 777 896 896 923 923 940 940 569 569 53
53 411]
[ 53 1009 1009 1884]
[1009 878]
[ 923 886 886 511 511 942 942 1067 1067 1888 1888 243 243 1556]
Your new case is the same:
>>> inds = np.where(np.diff(x))[0]
>>> out = np.split(x,inds[np.diff(inds)==1][0::2]+2)
>>> for n in out:
... print n
...
[ 1 4 4 5 5 8 8 10 10 25 25 70 70 90 90 100]
[2 3 3 4 4 5 5 8 8 9]
[20 21 21 22]
[23]
Starting with x
as list:
%timeit inds = np.where(np.diff(x))[0];out = np.split(x,inds[np.diff(inds)==1][0::2]+2)
10000 loops, best of 3: 169 µs per loop
If x
is a numpy array:
%timeit inds = np.where(np.diff(arr_x))[0];out = np.split(arr_x,inds[np.diff(inds)==1][0::2]+2)
10000 loops, best of 3: 135 µs per loop
For larger systems you can likely expect numpy to have better performance vs pure python.
I have an existing list. I want to break it up into separate lists whenever the following number is not equal to its preceding value.
x = [1,4,4,5,5,8,8,10,10,25,25,70,70,90,90,100,2,3,3,4,4,5,5,8,8,9,20,21,21,22,23)
The desired output should look like this:
a = [1,4,4,5,5,8,8,10,10,25,25,70,70,90,90,100)
b = [2,3,3,4,4,5,5,8,8,9)
c = [20,21,21,22]
d = [23]
def group(l,skip=0):
prevind = 0
currind = skip+1
for val in l[currind::2]:
if val != l[currind-1]:
if currind-prevind-1 > 1: yield l[prevind:currind-1]
prevind = currind-1
currind += 2
if prevind != currind:
yield l[prevind:currind]
Which for the list you defined, returns when called with skip=1
[38, 1200, 1200, 306, 306, 391, 391, 82, 82, 35, 35, 902, 902, 955, 955]
[13, 955, 847, 847, 835, 835, 698, 698, 777, 777, 896, 896, 923, 923, 940, 940, 569, 569, 53, 53]
[411, 53, 1009, 1009]
[1884, 1009]
[878, 923, 886, 886, 511, 511, 942, 942, 1067, 1067, 1888, 1888, 243, 243, 1556]
And a simpler example list [1,1,3,3,2,5]
:
for g in group(l2):
print g
[1, 1, 3, 3]
[2, 5]
The reason skip
is an optional parameter to the function is that in your example 38 was included despite it not being equal to 1200. If this was an error, then simply remove skip and set currind
to equal 1
initially.
Explanation:
In a list [a,b,c,d,e,...]
. We want to compare two elements with each other in succession i.e a == b
, c == d
, and then when a comparison doesn’t return True
, capture all previous elements (excluding those already captured). To do this we need to keep track of where the last capture took place, which initially is 0
(i.e no captures). We then go over each of the pairs, by going over ever 2nd element in the list starting at currind
which by default (when not skipping an element) is one. And then compare the value we get from l[currind::2]
to the value before it l[currind-1]
. currind
is the index of each 2nd element from currind
‘s inital value (1
by default). If the values don’t match then we need to perform a capture but only if the resulting capture would contain a term! Hence currind-prevind-1
> 1 (because the list slice will be that length -1, so it needs to be 2 or more to extract at least 1 element). l[prevind:currind-1]
does this capture, going from the index of the last comparison which didn’t match (or 0
the default) up till the element before first value in each comparison pair a,b
or c,d
etc.. Then prevind
is set to currind-1
i.e the index of the last element captured. We then increment currind
by 2 to go to the index of the next val
. Then finally, if there was a pair left over we extract it.
So for [1,1,3,3,2,5]
:
val is 1, at index 1. comparing to value at 0 i.e 1
make currind the index of last element of the next pair
val is 3, at index 3. comparing to value at 2 i.e 3
make currind the index of last element of the next pair
val is 5, at index 5. comparing to value at 4 i.e 2
not equal so get slice between 0,4
[1, 1, 3, 3]
make currind the index of last element of the next pair #happens after the for loop
[2, 5]
In order to answer your question:
I have […] a list. I want to break it up into separate lists whenever the following number is not equal to its preceding value.
Have a look at itertools.groupby
.
Example:
import itertools
l = [38, 1200, 1200, 306, 306, 391, 391, 82, 82, 35, 35, 902, 902, 955, 955, 13]
for x, v in itertools.groupby(l):
# `v` is an iterator that yields all subsequent elements
# that have the same value
# `x` is that value
print list(v)
The output is:
$ python test.py
[38]
[1200, 1200]
[306, 306]
[391, 391]
[82, 82]
[35, 35]
[902, 902]
[955, 955]
[13]
Which is apparently what you are asking for?
As for your pattern thing, here’s some generator function that, at the very least, produces the output you expect for the given input:
import itertools
def split_sublists(input_list):
sublist = []
for val, l in itertools.groupby(input_list):
l = list(l)
if not sublist or len(l) == 2:
sublist += l
else:
sublist += l
yield sublist
sublist = []
yield sublist
input_list = [1,4,4,5,5,8,8,10,10,25,25,70,70,90,90,100,2,3,3,4,4,5,5,8,8,9,20,21,21,22,23]
for sublist in split_sublists(input_list):
print sublist
The output:
$ python test.py
[1, 4, 4, 5, 5, 8, 8, 10, 10, 25, 25, 70, 70, 90, 90, 100]
[2, 3, 3, 4, 4, 5, 5, 8, 8, 9]
[20, 21, 21, 22]
[23]
Here’s my ugly-ish solution for this:
x = [38, 1200, 1200, 306, 306, 391, 391, 82, 82, 35, 35, 902, 902, 955, 955, 13, 955, 847, 847, 835, 83, 5698, 698, 777, 777, 896, 896, 923, 923, 940, 940, 569, 569, 53, 53, 411]
def weird_split(alist):
sublist = []
for i, n in enumerate(alist[:-1]):
sublist.append(n)
# make sure we only create a new list if the current one is not empty
if len(sublist) > 1 and n != alist[i-1] and n != alist[i+1]:
yield sublist
sublist = []
# always add the last element
sublist.append(alist[-1])
yield sublist
for sublist in weird_split(x):
print sublist
And the output:
[38, 1200, 1200, 306, 306, 391, 391, 82, 82, 35, 35, 902, 902, 955, 955, 13]
[955, 847, 847, 835]
[83, 5698]
[698, 777, 777, 896, 896, 923, 923, 940, 940, 569, 569, 53, 53, 411]
Firstly, you haven’t defined behaviour for [1, 0, 0, 1, 0, 0, 1]
, so this splits it into [1, 0, 0, 1]
, [0, 0]
and [1]
.
Secondly, there are a lot of corner cases that need to be handled correctly, so it’s longer than you might expect. This would also be shorted if it directly put things into lists, but generators are a good thing so I made sure not to do that.
Firstly, use the full iterator interface instead of the yield
shortcut because it allows better sharing of state between the outer and inner iterables without making a new subsection
generator each iteration. A nested def
with yield
s might be able to do this in less space, but in this case the wordiness is acceptable, I think.
So, set-up:
class repeating_sections:
def __init__(self, iterable):
self.iter = iter(iterable)
try:
self._cache = next(self.iter)
self.finished = False
except StopIteration:
self.finished = True
We need to define the sub-iterator that yields until it finds a pair that doesn’t match. Because the end would be removed from the iterator we need to yield
it on the next call to _subsection
, so store it in _cache
.
def _subsection(self):
yield self._cache
try:
while True:
item1 = next(self.iter)
try:
item2 = next(self.iter)
except StopIteration:
yield item1
raise
if item1 == item2:
yield item1
yield item2
else:
yield item1
self._cache = item2
return
except StopIteration:
self.finished = True
__iter__
should return self
for iterables:
def __iter__(self):
return self
__next__
returns a subsection unless finished. Note that exhausting the section is important if behaiour is to be reliable.
def __next__(self):
if self.finished:
raise StopIteration
subsection = self._subsection()
return subsection
for item in subsection:
pass
Some tests:
for item in repeating_sections(x):
print(list(item))
#>>> [38, 1200, 1200, 306, 306, 391, 391, 82, 82, 35, 35, 902, 902, 955, 955, 13]
#>>> [955, 847, 847, 835, 835, 698, 698, 777, 777, 896, 896, 923, 923, 940, 940, 569, 569, 53, 53, 411]
#>>> [53, 1009, 1009, 1884]
#>>> [1009, 878]
#>>> [923, 886, 886, 511, 511, 942, 942, 1067, 1067, 1888, 1888, 243, 243, 1556]
for item in repeating_sections([1, 0, 0, 1, 0, 0, 1]):
print(list(item))
#>>> [1, 0, 0, 1]
#>>> [0, 0]
#>>> [1]
Some timings to show this wasn’t totally pointless:
SETUP="
x = [38, 1200, 1200, 306, 306, 391, 391, 82, 82, 35, 35, 902, 902, 955, 955, 13, 955, 847, 847, 835, 83, 5698, 698, 777, 777, 896, 896, 923, 923, 940, 940, 569, 569, 53, 53, 411]
x *= 5000
class repeating_sections:
def __init__(self, iterable):
self.iter = iter(iterable)
try:
self._cache = next(self.iter)
self.finished = False
except StopIteration:
self.finished = True
def _subsection(self):
yield self._cache
try:
while True:
item1 = next(self.iter)
try:
item2 = next(self.iter)
except StopIteration:
yield item1
raise
if item1 == item2:
yield item1
yield item2
else:
yield item1
self._cache = item2
return
except StopIteration:
self.finished = True
def __iter__(self):
return self
def __next__(self):
if self.finished:
raise StopIteration
subsection = self._subsection()
return subsection
for item in subsection:
pass
def weird_split(alist):
sublist = []
for i, n in enumerate(alist[:-1]):
sublist.append(n)
# make sure we only create a new list if the current one is not empty
if len(sublist) > 1 and n != alist[i-1] and n != alist[i+1]:
yield sublist
sublist = []
# always add the last element
sublist.append(alist[-1])
yield sublist
"
python -m timeit -s "$SETUP" "for section in repeating_sections(x):" " for item in section: pass"
python -m timeit -s "$SETUP" "for section in weird_split(x):" " for item in section: pass"
Result:
10 loops, best of 3: 150 msec per loop
10 loops, best of 3: 207 msec per loop
Not a massive difference, but it’s faster nonetheless.
The numpy version:
>>> inds = np.where(np.diff(x))[0]
>>> out = np.split(x,inds[np.diff(inds)==1][0::2]+2)
>>> for n in out:
... print n
[ 38 1200 1200 306 306 391 391 82 82 35 35 902 902 955 955
13]
[955 847 847 835 835 698 698 777 777 896 896 923 923 940 940 569 569 53
53 411]
[ 53 1009 1009 1884]
[1009 878]
[ 923 886 886 511 511 942 942 1067 1067 1888 1888 243 243 1556]
Your new case is the same:
>>> inds = np.where(np.diff(x))[0]
>>> out = np.split(x,inds[np.diff(inds)==1][0::2]+2)
>>> for n in out:
... print n
...
[ 1 4 4 5 5 8 8 10 10 25 25 70 70 90 90 100]
[2 3 3 4 4 5 5 8 8 9]
[20 21 21 22]
[23]
Starting with x
as list:
%timeit inds = np.where(np.diff(x))[0];out = np.split(x,inds[np.diff(inds)==1][0::2]+2)
10000 loops, best of 3: 169 µs per loop
If x
is a numpy array:
%timeit inds = np.where(np.diff(arr_x))[0];out = np.split(arr_x,inds[np.diff(inds)==1][0::2]+2)
10000 loops, best of 3: 135 µs per loop
For larger systems you can likely expect numpy to have better performance vs pure python.