What Best way to find unique sublists of a given length that are present in a list?

Question:

I have built a function that finds all of the unique sublists, of length i, present in a given list.

For example if you have list=[0,1,1,0,1] and i=1, you just get [1,0]. If i=2, you get [[0,1],[1,1],[1,0]], but not [0,0] because while it is a possible combination of 1 and 0, it is not present in the given list. The code is listed below.

While the code functions, I do not believe it is the most efficient. It relies on finding all possible sublists and testing for the presence of each one, which becomes impractical at i > 4 (for say a list length of 100). I was hoping I could get help in finding a more efficient method for computing this. I am fully aware that this is probably not a great way to do this, but with what little knowledge I have its the first thing that I could come up with.

The code I have written:

def present_sublists (l,of_length):
    """
    takes a given list of 1s and 0s and returns all the unique sublist of that
    string that are of a certain length
    """
    l_str=[str(int) for int in l]   #converts entries in input to strings
    l_joined="".join(l_str) #joins input into one strings, i.e. "101010"
    sublist_sets=set(list(itertools.combinations(l_joined,of_length)))
    #uses itertools to get all possible combintations of substrings, and then set
    #properties to removes duplicates
    pos_sublists=list(sublist_sets) #returns the set to a list
    sublists1=[]
    for entry in pos_sublists:         #returns the entries to a list
        sublists1.append(list(entry))
    for entry in sublists1:            #returns the "1"s and "0" to 1s and 0s
        for entry2 in entry:
            entry[entry.index(entry2)]=int(entry2)
    present_sublists=[]
    for entry in sublists1:            #tests whether the possible sublist is
                                       #present in the input list
        for x in range(len(l) - len(entry) + 1):
            if entry not in present_sublists:
                if l[x: x + len(entry)] == entry:
                    present_sublists.append(entry)
    output=present_sublists
    return output
Asked By: YaGoi Root

||

Answers:

Let’s label the bits 0, 1, 2, 3, …..

Let’s also define a function f(len, n) where f(len, n) is defined to be set of all the strings of length len that occur in the first n bits.

So

f(0, n) = {''}  since you can always make the empty string
f(len, 0) = set() if len > 0

So what is the value of f(len, n) if len > 0 and n > 0? It contains everything in f(len, n - 1), plus in contains everything in f(len - 1, n - 1) with l[n-1] appended to it.

You now have everything you need to find f(of_length, len(l)) reasonably efficientlyt.

Answered By: Frank Yellin

Given your code and sample, look like you want all the unique contiguous sub-sequences of the given input, if so you don’t need to compute all combinations, neither shifting around between strings, list, set and back from string, let alone looping multiple times over the thing, using the slice notation is more that enough to get the desire result

>>> [0,1,2,3,4][0:2]
[0, 1]
>>> [0,1,2,3,4][1:3]
[1, 2]
>>> [0,1,2,3,4][2:4]
[2, 3]
>>> [0,1,2,3,4][3:5]
[3, 4]
>>> 

An appropriate use of the indexes from the slice get us all the contiguous sub-sequences of any given size (2 in the example)

Now to make this more automatic, we make an appropriate for loop

>>> seq=[0,1,2,3,4]
>>> size=2
>>> for i in range(len(seq)-size+1):
        print(seq[i:i+size])

    
[0, 1]
[1, 2]
[2, 3]
[3, 4]
>>> 

Now that we know how to get all the sub-sequences we care about, we focus on getting only the unique ones, for that of course we use a set but a list can’t be in a set, so we need something that can, so a tuple is the answer (which is basically an immutable list), and that is everything you need, lets put it all together:

>>> def sub_sequences(seq,size):
        """return a set with all the unique contiguous sub-sequences of the given size of the given input"""
        seq = tuple(seq) #make it into a tuple so it can be used in a set
        if size>len(seq) or size<0: #base/trivial case
            return set() #or raise an exception like ValueError
        return {seq[i:i+size] for i in range(len(seq)-size+1)} #a set comprehension version of the previous mentioned loop

>>> sub_sequences([0,1,2,3,4],2)
{(0, 1), (1, 2), (2, 3), (3, 4)}
>>>
>>> #now lets use your sample
>>>
>>> sub_sequences([0,1,1,0,1],2)
{(0, 1), (1, 0), (1, 1)}
>>> sub_sequences([0,1,1,0,1],3)
{(1, 0, 1), (1, 1, 0), (0, 1, 1)}
>>> sub_sequences([0,1,1,0,1],4)
{(1, 1, 0, 1), (0, 1, 1, 0)}
>>> sub_sequences([0,1,1,0,1],5)
{(0, 1, 1, 0, 1)}
>>> 
Answered By: Copperfield

To stick to your function footprint I would suggest something like:

  1. Iterate through each sublist and put them into a set() to ensure the uniqueness
  2. The sublists needs to be converted to tuples since lists cannot be hashed therefore cannot be put into sets as they are
  3. Convert the resulted tuples in the set back to the required formats.

When creating new lists, list comprehension is the most effective and pythonic way to choose.

>>> def present_sublists(l,of_length):
...   sublists = set([tuple(l[i:i+of_length]) for i in range(0,len(l)+1-of_length)])
...   return [list(sublist) for sublist in sublists]
...
>>> present_sublists([0,1,1,0,1], 1)
[[0], [1]]
>>> present_sublists([0,1,1,0,1], 2)
[[0, 1], [1, 0], [1, 1]]
Answered By: martoni
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.