Find longest repetitive sequence in a string

Question:

I need to find the longest sequence in a string with the caveat that the sequence must be repeated three or more times. So, for example, if my string is:

fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld

then I would like the value “helloworld” to be returned.

I know of a few ways of accomplishing this but the problem I’m facing is that the actual string is absurdly large so I’m really looking for a method that can do it in a timely fashion.

Asked By: Snesticle

||

Answers:

This problem is a variant of the longest repeated substring problem and there is an O(n)-time algorithm for solving it that uses suffix trees. The idea (as suggested by Wikipedia) is to construct a suffix tree (time O(n)), annotate all the nodes in the tree with the number of descendants (time O(n) using a DFS), and then to find the deepest node in the tree with at least three descendants (time O(n) using a DFS). This overall algorithm takes time O(n).

That said, suffix trees are notoriously hard to construct, so you would probably want to find a Python library that implements suffix trees for you before attempting this implementation. A quick Google search turns up this library, though I’m not sure whether this is a good implementation.

Another option would be to use suffix arrays in conjunction with LCP arrays. You can iterate over pairs of adjacent elements in the LCP array, taking the minimum of each pair, and store the largest number you find this way. That will correspond to the length of the longest string that repeats at least three times, and from there you can then read off the string itself.

There are several simple algorithms for building suffix arrays (the Manber-Myers algorithm runs in time O(n log n) and isn’t too hard to code up), and Kasai’s algorithm builds LCP arrays in time O(n) and is fairly straightforward to code up.

Hope this helps!

Answered By: templatetypedef

The first idea that came to mind is searching with progressively larger regular expressions:

import re

text = 'fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
largest = ''
i = 1

while 1:
    m = re.search("(" + ("w" * i) + ").*\1.*\1", text)
    if not m:
        break
    largest = m.group(1)
    i += 1

print largest    # helloworld

The code ran successfully. The time complexity appears to be at least O(n^2).

Answered By: Matt Coughlin

Let’s start from the end, count the frequency and stop as soon as the most frequent element appears 3 or more times.

from collections import Counter
a='fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
times=3
for n in range(1,len(a)/times+1)[::-1]:
    substrings=[a[i:i+n] for i in range(len(a)-n+1)]
    freqs=Counter(substrings)
    if freqs.most_common(1)[0][1]>=3:
        seq=freqs.most_common(1)[0][0]
        break
print "sequence '%s' of length %s occurs %s or more times"%(seq,n,times)

Result:

>>> sequence 'helloworld' of length 10 occurs 3 or more times

Edit: if you have the feeling that you’re dealing with random input and the common substring should be of small length, you better start (if you need the speed) with small substrings and stop when you can’t find any that appear at least 3 time:

from collections import Counter
a='fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
times=3
for n in range(1,len(a)/times+1):
    substrings=[a[i:i+n] for i in range(len(a)-n+1)]
    freqs=Counter(substrings)
    if freqs.most_common(1)[0][1]<3:
        n-=1
        break
    else:
        seq=freqs.most_common(1)[0][0]
print "sequence '%s' of length %s occurs %s or more times"%(seq,n,times) 

The same result as above.

Answered By: Max Li

Use defaultdict to tally each substring beginning with each position in the input string. The OP wasn’t clear whether overlapping matches should or shouldn’t be included, this brute force method includes them.

from collections import defaultdict

def getsubs(loc, s):
    substr = s[loc:]
    i = -1
    while(substr):
        yield substr
        substr = s[loc:i]
        i -= 1

def longestRepetitiveSubstring(r, minocc=3):
    occ = defaultdict(int)
    # tally all occurrences of all substrings
    for i in range(len(r)):
        for sub in getsubs(i,r):
            occ[sub] += 1

    # filter out all substrings with fewer than minocc occurrences
    occ_minocc = [k for k,v in occ.items() if v >= minocc]

    if occ_minocc:
        maxkey =  max(occ_minocc, key=len)
        return maxkey, occ[maxkey]
    else:
        raise ValueError("no repetitions of any substring of '%s' with %d or more occurrences" % (r,minocc))

prints:

('helloworld', 3)
Answered By: PaulMcG

If you reverse the input string, then feed it to a regex like (.+)(?:.*1){2}
It should give you the longest string repeated 3 times. (Reverse capture group 1 for the answer)

Edit:
I have to say cancel this way. It’s dependent on the first match. Unless its tested against a curr length vs max length so far, in an itterative loop, regex won’t work for this.

Answered By: user557597
from collections import Counter

def Longest(string):

    b = []
    le = []

    for i in set(string):

        for j in range(Counter(string)[i]+1): 
            b.append(i* (j+1))

    for i in b:
        if i in string:
            le.append(i)


    return ([s for s in le if len(s)==len(max( le , key = len))])
Answered By: FellerRock

In Python you can use the string count method.
We also use an additional generator which will generate all the unique substrings of a given length for our example string.

The code is straightforward:

test_string2 = 'fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'

def generate_substrings_of_length(this_string, length):
    ''' Generates unique substrings of a given length for a given string'''
    for i in range(len(this_string)-2*length+1):
        yield this_string[i:i+length]

def longest_substring(this_string):
    '''Returns the string with at least two repetitions which has maximum length'''
    max_substring = ''
    for subs_length in range(2, len(this_string) // 2 + 1):
        for substring in generate_substrings_of_length(this_string, subs_length):
            count_occurences = this_string.count(substring)
            if count_occurences > 1 :
                if len(substring) > len(max_substring) :
                    max_substring = substring
    return max_substring

I must note here (and this is important) that the generate_substrings_of_length generator does not generate all the substrings of a certain length. It will generate only the required substring to be able to make comparisons. Otherwise we will have some artificial duplicates. For example in the case :

test_string = "banana"

GS = generate_substrings_of_length(test_string , 2)
for i in GS: print(i)

will result :

ba
an
na

and this is enough for what we need.

Answered By: youth4ever
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.