Longest Repeating Subsequence: Edge Cases

Question:

Problem

While solving the Longest Repeating Subsequence problem using bottom-up dynamic programming, I started running into an edge case whenever a letter was repeated an odd number of times.

The goal is to find the longest subsequence that occurs twice in the string using elements at different indices. The ranges can overlap, but the indices should be disjoint (i.e., str[1], str[4] and str[2], str[6] can be a solution, but not str[1], str[2] and str[2], str[3].

Minimum Reproducible Example

s = 'AXXXA'

n = len(s)

dp = [['' for i in range(n + 1)] for j in range(n + 1)]

for i in range(1, n + 1):
  for j in range(1, n + 1):
    if (i != j and s[i - 1] == s[j - 1]):
      dp[i][j] = dp[i - 1][j - 1] + s[i - 1]
    else:
      dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])

print(dp[n][n])

Question

Any pointers on how to avoid this?

With input s = ‘AXXXA’, the answer should be either A or X, but the final result returns XX, apparently pairing up the third X with both the first X and the second X.

False Start

I don’t want to add a check on a match (s[i - 1] == s[j - 1]) to see if s[i - 1] in dp[i - 1][j - 1] because another input might be something like AAJDDAJJTATA, which must add the A twice.

Asked By: William Edwardson

||

Answers:

  • For retrieving the longest, it’s best to implement a new function, with the result of dp grid.

  • Your algorithm is fine, you only need to increment your new dp by 1, when s[i - 1] == s[j - 1]:

    n = len(s)
    dp = [[0 for _ in range(n + 1)] for _ in range(n + 1)]
    for i in range(1, n + 1):
        for j in range(1, n + 1):
            if s[i - 1] == s[j - 1] and i != j:
                dp[i][j] = 1 + dp[i - 1][j - 1]
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])

    return dp[-1][-1]

If you want to build the longest, you can use the dp grid and whenever the if s[i - 1] == s[j - 1] and i != j satisfies, append the char to the longest:

    def get_longest():
        i, j = n, n
        lrs = []
        while i > 0 and j > 0:
            if s[i - 1] == s[j - 1] and i != j:
                lrs.append(s[i - 1])
                i -= 1
                j -= 1
            elif dp[i - 1][j] > dp[i][j - 1]:
                i -= 1
            else:
                j -= 1

        return ''.join(lrs[::-1])

Code

def LRS(s):
    def get_longest():
        i, j = n, n
        lrs = []
        while i > 0 and j > 0:
            if s[i - 1] == s[j - 1] and i != j:
                lrs.append(s[i - 1])
                i -= 1
                j -= 1
            elif dp[i - 1][j] > dp[i][j - 1]:
                i -= 1
            else:
                j -= 1

        return ''.join(lrs[::-1])
    n = len(s)
    dp = [[0 for _ in range(n + 1)] for _ in range(n + 1)]
    for i in range(1, n + 1):
        for j in range(1, n + 1):
            if s[i - 1] == s[j - 1] and i != j:
                dp[i][j] = 1 + dp[i - 1][j - 1]
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
    print(get_longest())
    return dp[-1][-1]


s = 'AXXXA'
print(LRS(s))

Prints

XX
2

Answered By: user24714692

You can keep track of the indices of the last added characters, and make sure that when two characters are the same, their indices have to be not only different to each other but also to that of the last added character:

s = 'AXXXA'
n = len(s)
dp = [[('', 0, 0) for i in range(n + 1)] for j in range(n + 1)]

for i in range(1, n + 1):
    for j in range(1, n + 1):
        last = dp[i - 1][j - 1]
        if last[2] != i != j != last[1] and s[i - 1] == s[j - 1]:
            dp[i][j] = last[0] + s[i - 1], i, j
        else:
            dp[i][j] = max(dp[i - 1][j], dp[i][j - 1], key=lambda t: t[0])

print(dp[n][n][0]) # outputs X

Demo: https://ideone.com/mSKgKy

Answered By: blhsing

Actually, your initial algorithm and its answer are correct (… but this is a good question because others might confuse what an LRS means).

Given your input (in), the subsequences (s1, s2) are:

in: AXXXA
s1:  XX
s2:   XX

So XX (length 2) is indeed the correct answer here.

X would be the correct answer for the problem’s non-overlapping version, where the ranges – not just individual indices – must be disjoint.

Answered By: Nathan Davis