Split list based on when a pattern of consecutive numbering stops

Question:

I have an existing list. I want to break it up into separate lists whenever the following number is not equal to its preceding value.

x = [1,4,4,5,5,8,8,10,10,25,25,70,70,90,90,100,2,3,3,4,4,5,5,8,8,9,20,21,21,22,23)

The desired output should look like this:

 a = [1,4,4,5,5,8,8,10,10,25,25,70,70,90,90,100)

 b = [2,3,3,4,4,5,5,8,8,9)

 c = [20,21,21,22]

 d = [23]
Asked By: West1234

||

Answers:

def group(l,skip=0):
    prevind = 0
    currind = skip+1
    for val in l[currind::2]:
        if val != l[currind-1]:
            if currind-prevind-1 > 1: yield l[prevind:currind-1]
            prevind = currind-1
        currind += 2
    if prevind != currind:
        yield l[prevind:currind]

Which for the list you defined, returns when called with skip=1

[38, 1200, 1200, 306, 306, 391, 391, 82, 82, 35, 35, 902, 902, 955, 955]
[13, 955, 847, 847, 835, 835, 698, 698, 777, 777, 896, 896, 923, 923, 940, 940, 569, 569, 53, 53]
[411, 53, 1009, 1009]
[1884, 1009]
[878, 923, 886, 886, 511, 511, 942, 942, 1067, 1067, 1888, 1888, 243, 243, 1556]

And a simpler example list [1,1,3,3,2,5]:

for g in group(l2):
    print g

[1, 1, 3, 3]
[2, 5]

The reason skip is an optional parameter to the function is that in your example 38 was included despite it not being equal to 1200. If this was an error, then simply remove skip and set currind to equal 1 initially.


Explanation:

In a list [a,b,c,d,e,...]. We want to compare two elements with each other in succession i.e a == b, c == d, and then when a comparison doesn’t return True, capture all previous elements (excluding those already captured). To do this we need to keep track of where the last capture took place, which initially is 0 (i.e no captures). We then go over each of the pairs, by going over ever 2nd element in the list starting at currind which by default (when not skipping an element) is one. And then compare the value we get from l[currind::2] to the value before it l[currind-1]. currind is the index of each 2nd element from currind‘s inital value (1 by default). If the values don’t match then we need to perform a capture but only if the resulting capture would contain a term! Hence currind-prevind-1 > 1 (because the list slice will be that length -1, so it needs to be 2 or more to extract at least 1 element). l[prevind:currind-1] does this capture, going from the index of the last comparison which didn’t match (or 0 the default) up till the element before first value in each comparison pair a,b or c,d etc.. Then prevind is set to currind-1 i.e the index of the last element captured. We then increment currind by 2 to go to the index of the next val. Then finally, if there was a pair left over we extract it.

So for [1,1,3,3,2,5]:

val is 1, at index 1. comparing to value at 0 i.e 1
make currind the index of last element of the next pair
val is 3, at index 3. comparing to value at 2 i.e 3
make currind the index of last element of the next pair
val is 5, at index 5. comparing to value at 4 i.e 2
not equal so get slice between 0,4
[1, 1, 3, 3]
make currind the index of last element of the next pair  #happens after the for loop
[2, 5]
Answered By: HennyH

In order to answer your question:

I have […] a list. I want to break it up into separate lists whenever the following number is not equal to its preceding value.

Have a look at itertools.groupby.

Example:

import itertools
l = [38, 1200, 1200, 306, 306, 391, 391, 82, 82, 35, 35, 902, 902, 955, 955, 13]
for x, v in itertools.groupby(l):
    # `v` is an iterator that yields all subsequent elements
    # that have the same value
    # `x` is that value
    print list(v)

The output is:

$ python test.py
[38]
[1200, 1200]
[306, 306]
[391, 391]
[82, 82]
[35, 35]
[902, 902]
[955, 955]
[13]

Which is apparently what you are asking for?


As for your pattern thing, here’s some generator function that, at the very least, produces the output you expect for the given input:

import itertools

def split_sublists(input_list):
    sublist = []
    for val, l in itertools.groupby(input_list):
        l = list(l)
        if not sublist or len(l) == 2:
            sublist += l
        else:
            sublist += l
            yield sublist
            sublist = []
    yield sublist

input_list = [1,4,4,5,5,8,8,10,10,25,25,70,70,90,90,100,2,3,3,4,4,5,5,8,8,9,20,21,21,22,23]
for sublist in split_sublists(input_list):
    print sublist

The output:

$ python test.py
[1, 4, 4, 5, 5, 8, 8, 10, 10, 25, 25, 70, 70, 90, 90, 100]
[2, 3, 3, 4, 4, 5, 5, 8, 8, 9]
[20, 21, 21, 22]
[23]
Answered By: moooeeeep

Here’s my ugly-ish solution for this:

x = [38, 1200, 1200, 306, 306, 391, 391, 82, 82, 35, 35, 902, 902, 955, 955, 13, 955, 847, 847, 835, 83, 5698, 698, 777, 777, 896, 896, 923, 923, 940, 940, 569, 569, 53, 53, 411]

def weird_split(alist):
    sublist = []
    for i, n in enumerate(alist[:-1]):
        sublist.append(n)
        # make sure we only create a new list if the current one is not empty
        if len(sublist) > 1 and n != alist[i-1] and n != alist[i+1]:
            yield sublist
            sublist = []
    # always add the last element
    sublist.append(alist[-1])
    yield sublist

for sublist in weird_split(x):
    print sublist

And the output:

[38, 1200, 1200, 306, 306, 391, 391, 82, 82, 35, 35, 902, 902, 955, 955, 13]
[955, 847, 847, 835]
[83, 5698]
[698, 777, 777, 896, 896, 923, 923, 940, 940, 569, 569, 53, 53, 411]
Answered By: stranac

Firstly, you haven’t defined behaviour for [1, 0, 0, 1, 0, 0, 1], so this splits it into [1, 0, 0, 1], [0, 0] and [1].

Secondly, there are a lot of corner cases that need to be handled correctly, so it’s longer than you might expect. This would also be shorted if it directly put things into lists, but generators are a good thing so I made sure not to do that.

Firstly, use the full iterator interface instead of the yield shortcut because it allows better sharing of state between the outer and inner iterables without making a new subsection generator each iteration. A nested def with yields might be able to do this in less space, but in this case the wordiness is acceptable, I think.

So, set-up:

class repeating_sections:
    def __init__(self, iterable):
        self.iter = iter(iterable)

        try:
            self._cache = next(self.iter)
            self.finished = False
        except StopIteration:
            self.finished = True

We need to define the sub-iterator that yields until it finds a pair that doesn’t match. Because the end would be removed from the iterator we need to yield it on the next call to _subsection, so store it in _cache.

    def _subsection(self):
        yield self._cache

        try:
            while True:
                item1 = next(self.iter)

                try:
                    item2 = next(self.iter)
                except StopIteration:
                    yield item1
                    raise

                if item1 == item2:
                    yield item1
                    yield item2

                else:
                    yield item1
                    self._cache = item2
                    return

        except StopIteration:
            self.finished = True

__iter__ should return self for iterables:

    def __iter__(self):
        return self

__next__ returns a subsection unless finished. Note that exhausting the section is important if behaiour is to be reliable.

    def __next__(self):
        if self.finished:
            raise StopIteration

        subsection = self._subsection()
        return subsection

        for item in subsection:
            pass

Some tests:

for item in repeating_sections(x):
    print(list(item))
#>>> [38, 1200, 1200, 306, 306, 391, 391, 82, 82, 35, 35, 902, 902, 955, 955, 13]
#>>> [955, 847, 847, 835, 835, 698, 698, 777, 777, 896, 896, 923, 923, 940, 940, 569, 569, 53, 53, 411]
#>>> [53, 1009, 1009, 1884]
#>>> [1009, 878]
#>>> [923, 886, 886, 511, 511, 942, 942, 1067, 1067, 1888, 1888, 243, 243, 1556]


for item in repeating_sections([1, 0, 0, 1, 0, 0, 1]):
    print(list(item))
#>>> [1, 0, 0, 1]
#>>> [0, 0]
#>>> [1]

Some timings to show this wasn’t totally pointless:

SETUP="
x = [38, 1200, 1200, 306, 306, 391, 391, 82, 82, 35, 35, 902, 902, 955, 955, 13, 955, 847, 847, 835, 83, 5698, 698, 777, 777, 896, 896, 923, 923, 940, 940, 569, 569, 53, 53, 411]
x *= 5000

class repeating_sections:
    def __init__(self, iterable):
        self.iter = iter(iterable)

        try:
            self._cache = next(self.iter)
            self.finished = False
        except StopIteration:
            self.finished = True

    def _subsection(self):
        yield self._cache

        try:
            while True:
                item1 = next(self.iter)

                try:
                    item2 = next(self.iter)
                except StopIteration:
                    yield item1
                    raise

                if item1 == item2:
                    yield item1
                    yield item2

                else:
                    yield item1
                    self._cache = item2
                    return

        except StopIteration:
            self.finished = True

    def __iter__(self):
        return self

    def __next__(self):
        if self.finished:
            raise StopIteration

        subsection = self._subsection()
        return subsection

        for item in subsection:
            pass


def weird_split(alist):
    sublist = []
    for i, n in enumerate(alist[:-1]):
        sublist.append(n)
        # make sure we only create a new list if the current one is not empty
        if len(sublist) > 1 and n != alist[i-1] and n != alist[i+1]:
            yield sublist
            sublist = []
    # always add the last element
    sublist.append(alist[-1])
    yield sublist
"

python -m timeit -s "$SETUP" "for section in repeating_sections(x):" "    for item in section: pass"
python -m timeit -s "$SETUP" "for section in weird_split(x):"        "    for item in section: pass"

Result:

10 loops, best of 3: 150 msec per loop
10 loops, best of 3: 207 msec per loop

Not a massive difference, but it’s faster nonetheless.

Answered By: Veedrac

The numpy version:

>>> inds = np.where(np.diff(x))[0]
>>> out = np.split(x,inds[np.diff(inds)==1][0::2]+2)
>>> for n in out:
...     print n

[  38 1200 1200  306  306  391  391   82   82   35   35  902  902  955  955
   13]
[955 847 847 835 835 698 698 777 777 896 896 923 923 940 940 569 569  53
  53 411]
[  53 1009 1009 1884]
[1009  878]
[ 923  886  886  511  511  942  942 1067 1067 1888 1888  243  243 1556]

Your new case is the same:

>>> inds = np.where(np.diff(x))[0]
>>> out = np.split(x,inds[np.diff(inds)==1][0::2]+2)
>>> for n in out:
...     print n
...
[  1   4   4   5   5   8   8  10  10  25  25  70  70  90  90 100]
[2 3 3 4 4 5 5 8 8 9]
[20 21 21 22]
[23]

Starting with x as list:

%timeit inds = np.where(np.diff(x))[0];out = np.split(x,inds[np.diff(inds)==1][0::2]+2)
10000 loops, best of 3: 169 µs per loop

If x is a numpy array:

%timeit inds = np.where(np.diff(arr_x))[0];out = np.split(arr_x,inds[np.diff(inds)==1][0::2]+2)
10000 loops, best of 3: 135 µs per loop

For larger systems you can likely expect numpy to have better performance vs pure python.

Answered By: Daniel
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.