Finding Longest Common Substring

Question

I’m trying to find the longest common substring in all DNA strands but in testing I’m using strings of numbers.

Here is the funciton I wrote:

def lcsm(string_list):

    strands = sorted(string_list, key = len, reverse = False)
    longest_motif = ''

    seq1 = strands[0]
    seq2 = strands[1]

    motif = ''
    for ind1 in range(len(seq1)): # iterate over the shortest string
        i = 0
        for ind2 in range(len(seq2)):
            if ind1+i < len(seq1) and seq1[ind1+i] == seq2[ind2]:
                motif += seq1[ind1+i]
                i += 1
                if len(motif) >= len(longest_motif) and all(motif in x for x in strands):
                    longest_motif = motif                    
            else:
                motif = ''
                i = 0

    return longest_motif
    
print('right: ', lcsm(['123456789034357890', 
            '123456789034357890890357890', 
            '4612345678901234567890343578904654734357890', 
            '12356734121234567890343578903456789035789012345']))        

print('wrong: ', lcsm(['123456789034357890', 
            '123123456789034357890890357890', 
            '4612345678901234567890343578904654734357890', 
            '12356734121234567890343578903456789035789012345']))

My input is list of strings and the output should be the longest common string. In this case the result should be: ‘123456789034357890‘.

My problem is that when my searched sequence is preceded by a cluster of digits with which this sequence begins the first digit of the right answer is skipped.

The first print of my function shows the right answer and the second one has the mistake I’ve spoken about.

Pay attention to the second string in the list (in the ‘wrong‘ print statement).

As you see below, the first digit ‘1’ is missing.

right:  123456789034357890
wrong:  23456789034357890

Asked By: Kacper Kaszuba

||

Source

Answer 1

You are only checking for the presence of the current motif substring in all the strands when its length is greater than or equal to the length of the previous longest common substring and you are not accounting for cases where the current motif substring is shorter than the previous longest common substring.
Instead you should only update the longest_motif variable if the current motif substring is longer than the previous longest common substring and is present in all the strands by modifying the function like this:

def lcsm(string_list):

    strands = sorted(string_list, key=len)
    shortest_strand = strands[0]
    longest_motif = ''

    for i in range(len(shortest_strand)):
        for j in range(i + len(longest_motif) + 1, len(shortest_strand) + 1):
            motif = shortest_strand[i:j]
            if all(motif in strand for strand in strands[1:]):
                longest_motif = motif

    return longest_motif

    
print('right: ', lcsm(['123456789034357890', 
            '123456789034357890890357890', 
            '4612345678901234567890343578904654734357890', 
            '12356734121234567890343578903456789035789012345']))        

print('right: ', lcsm(['123456789034357890', 
            '123123456789034357890890357890', 
            '4612345678901234567890343578904654734357890', 
            '12356734121234567890343578903456789035789012345']))

Answered By: Ake

Finding Longest Common Substring

Question:

Answers: