Max Number of unique substrings from a partition

Question:

I modified the title so that it is more understandable.

Here is a detailed version of the question:

We have a string s and want to split it into substrings. Each substring is different from each other. What is the maximum number of unique substrings that we can have from one cut. In other words, what is the maximum number of unique substrings that concatenate to form s.

Here are some examples:

Example 1
s = 'aababaa'
output = 4
Explain: we can split `s` into aa|b|aba|a or aab|a|b|aa, 
         and 4 is the max number of substrings we can get from one split.

Example 2
s = 'aba'
output = 2
Explain: a|ba

Example 3
s = 'aaaaaaa'
output = 3
Explain: a|aa|aaaa

Note: s only contains lowercase characters. I am not told how long s and hence cannot guess the optimal time complexity. 🙁

Is it a NP-hard problem? If not, how can I solve it efficiently?

I heard this problem from one of my friend and couldn’t answer it. I am trying to use a Trie + greedy to solve this problem. The method fails for the first example.

Here is the Trie solution that I came up with:

def triesolution(s):
    trie = {}
    p = trie
    output = 0
    for char in s:
        if char not in p:
            output += 1
            p[char] = {}
            p = trie
        else:
            p = p[char]
    return output

For example 1, the above code will return 3 since it is trying to split s into a|ab|abaa.

Add: Thanks to everyone’s idea, it looks like this problem is very close to an NP problem. Right now, I am trying to think it from this direction. Suppose we have a function Guess(n). This function will return True if we could find n unique substrings from one split or False otherwise. One observation here is that if Guess(n) == True, then Guess(i) == True for all i <= n. Since we can merge two adjacent substrings together. This observation can lead to a binary solution. However, it still requires we can compute the Guess function very efficiently. Sadly, I still could not find out a polynomial way to compute Guess(n).

Asked By: wqm1800

||

Answers:

Here’s a solution but it blows up really fast and is not anywhere near an efficient solution. It first breaks the string down into a list of unique substrings with no concern for ordering, then attempts to use itertools.permutation to reassemble those substrings back into the original string, testing EACH permutation to see if it matches the original string.

import itertools as it

def splitter(seq):                                                             
    temp = [seq]
    for x in range(1, len(seq)):
        print(seq[:x], seq[x:])
        temp.append(seq[:x])
        temp.append(seq[x:])
    return temp

if __name__ == "__main__":
    test = input("Enter a string: ")
    temp = splitter(test)
    copy = temp[::]
    condition = True
    for x in temp:
        if len(x) > 1:
            copy.extend(splitter(x))
    copy = sorted(list(set(copy)))
    print(copy)
    count = []
    for x in range(len(test)):
        item = it.permutations(copy, x)
        try:
            while True:
                temp = next(item)
                if "".join(list(temp)) == test:
                    if len(temp) == len(set(temp)):
                        count.append((len(temp), temp))
        except StopIteration:
            print('next permutation begin iteration')
            continue
    print(f"All unique splits: {count}")
    print(f"Longest unique split : {max(count)[0]}")

For the first test we get this:

All unique splits: [(1, ('aababaa',)), (2, ('a', 'ababaa')), (2, ('aa', 'babaa')), (2, 
('aab', 'abaa')), (2, ('aaba', 'baa')), (2, ('aabab', 'aa')), (2, ('aababa', 'a')), (3, 
('a', 'ab', 'abaa')), (3, ('a', 'aba', 'baa')), (3, ('a', 'abab', 'aa')), (3, ('aa', 'b',
 'abaa')), (3, ('aa', 'ba', 'baa')), (3, ('aa', 'baba', 'a')), (3, ('aab', 'a', 'baa')),
 (3, ('aab', 'ab', 'aa')), (3, ('aab', 'aba', 'a')), (3, ('aaba', 'b', 'aa')), (3,
 ('aaba', 'ba', 'a')), (4, ('a', 'aba', 'b', 'aa')), (4, ('aa', 'b', 'a', 'baa')), (4,
 ('aa', 'b', 'aba', 'a')), (4, ('aab', 'a', 'b', 'aa'))]
Longest unique split : 4

Perhaps this can be optimized somehow, but that takes quite a few seconds on this machine.

Answered By: neutrino_logic

I have given this problem a try and thought of it in terms or whether to make a partition at a given index.
So this function is recursive and creates 2 branches at each index
1. Dont partition at index i
2. Partition at index i.

Based on the partition i fill in a set and then return the size of set

def max(a,b):
    if a>b: return a
    return b



def keep(last, current, inp, map):
    # print last
    # print current
    # print map

    if len(inp) == 2 :
        if inp[0]==inp[1]: return 1
        return 2

    if current >= len(inp):
        return len(map)
    // This is when we are at the start of the string. 
    // In this case we can only do one thing not partition and thus take the entire string as a possible string.

    if current == last :
        map11 = map.copy()
        map11.add(inp[current:])
        return keep(last, current + 1, inp, map11)

    map1 = map.copy();
    if current != (len(inp)-1):
        map1.add(inp[last:current])

    map2 = map.copy()

    return max(keep(last,current+1,inp, map2), keep(current, current+1, inp, map1))

print keep(0,0,"121", set([]))
print keep(0,0,"aaaaaaa", set([]))
print keep(0,0,"aba", set([]))
print keep(0,0,"aababaa", set([]))
print keep(0,0,"21", set([]))
print keep(0,0,"22", set([]))

https://onlinegdb.com/HJynWw-iH

Answered By: Ravi Chandak

This is known as the collision-aware string partition problem and is shown to be NP-complete by a reduction from 3-SAT in a paper by Anne Condon, Ján Maňuch and Chris Thachuk – Complexity of a collision-aware string partition problem and its relation to oligo design for gene synthesis (International Computing and Combinatorics Conference, 265-275, 2008).

Answered By: גלעד ברקן

You can use a recursive function with a set as a second parameter to keep track of the unique strings in the current path so far. For each recursion, iterate through all the indices plus 1 at which to split the string for a possible candidate string, and if the candidate string is not yet in the set, make a recursive call with the remaining string and the candidate added to the set to get the maximum number of unique substrings from the remaining string, add 1 to it and return the maximum of the maximums from the iterations. Return 0 if either the given string is empty or all the candidate strings are already in the set:

def max_unique_substrings(s, seen=()):
    maximum = 0
    for i in range(1, len(s) + 1):
        candidate = s[:i]
        if candidate not in seen:
            maximum = max(maximum, 1 + max_unique_substrings(s[i:], {candidate, *seen}))
    return maximum

Demo: https://repl.it/@blhsing/PriceyScalySphere

In Python 3.8, the above logic can also be written with a call to the max function with a generator expression that filters candidates that have been “seen” with an assignment expression:

def max_unique_substrings(s, seen=()):
    return max((1 + max_unique_substrings(s[i:], {candidate, *seen}) for i in range(1, len(s) + 1) if (candidate := s[:i]) not in seen), default=0)
Answered By: blhsing

My other answer was closely related but didn’t correspond exactly to this problem, leaving ambiguous whether finding the largest equality-free string factorisation might be of a different complexity class than whether there exists any equality-free factorisation with bound factor length (the latter being addressed by the cited paper).

In the paper, Pattern matching with variables: Fast algorithms and new hardness results (Henning Fernau, Florin Manea, Robert Mercaş, and Markus L. Schmid, in Proc. 32nd Symposium on Theoretical Aspects of Computer Science, STACS 2015, volume 30 of Leibniz International Proceedings in Informatics (LIPIcs), pages 302–315, 2015), the authors show that it is NP-complete to decide, for a given number k and a word w, whether w can be factorised into k distinct factors.

If we consider templatetypedef’s comment, implying there could be a polynomial time solution to the unrestricted, largest equality-free factorisation then surely we could use such an algorithm to answer if we could split the string into k distinct factors (substrings) by simply observing if k is less than the max we already know.

Schmid (2016), however, writes that “it is still an open problem whether MaxEFF-s remains NP-complete if the alphabet is fixed.” (Computing equality-free and repetitive string factorisations, Theoretical Computer Science Volume 618, 7 March 2016, Pages 42-51)

Maximum Equality-Free Factorisation Size (MaxEFF-s) is still parameterised, though, and is defined as:

Instance: A word w and a number m, 1 ≤ m ≤ |w|.

Question: Does there exist an equality-free factorisation p of w with s(p) ≥ m? (s(p) being the size of the factorisation.)

Answered By: גלעד ברקן

(Many thanks to Gilad Barkan (גלעד ברקן) for making me aware of this discussion.)

Let me share my thoughts about this problem from a purely theoretical point of view (note that I also use “factor” instead of “subword”).

I think a sufficiently formal definition of the problem (or problems) considered here is the following:

Given a word w, find words u_1, u_2, …, u_k such that

  • u_i != u_j for every i, j with 1 <= i < j <= k and
  • u_1 u_2… u_k = w

Maximisation variant (we want many u_i): maximize k

Minimisation variant (we want short u_i): minimise max{|u_i| : 1 <= i <= k}

These problems become decision problems by additionally giving a bound B, which, according to whether we are talking about the “many-factors”-variant or the “short factors”-variant, is a lower bound on k (we want at least B factors), or an upper bound on max{|u_i| : 1 <= i <= k} (we want factors of length at most B), respectively.
For talking about NP-hardness, we need to talk about decision problems.

Let’s use the terms SF for the “short factors”-variant and MF for the “many factors”-variant.
In particular, and this is a really crucial point, the problems are defined in such a way that we get a word over some alphabet that is not in any way restricted. The problem version were we know a priori that we only get input words over, say, alphabet {a, b, c, d} is a different problem! NP-hardness does not automatically carry over from the “unrestricted” to the “fixed alphabet” variant (the latter might be simpler).

Both SF and MF are NP-complete problems. This has been shown in [1, 1b] and [2], respectively (as Gilad has pointed out already).
If I understand the (maybe too) informal problem definition here at the beginning of this discussion correctly, then the problem of this discussion is exactly the problem MF. It is initially not mentioned that the words are restricted to come from some fixed alphabet, later it is said that we can assume that only lower-case letters are used. If this means that we only consider words over the fixed alphabet {a, b, c, …, z}, then this would change a lot actually in terms of NP-hardness.

A closer look reveals some differences in complexity of SF and MF:

  1. paper [1, 1b] shows that SF remains NP-complete if we fix the alphabet to a binary one (more precisely: getting a word w over letters a and b and a bound B, can we factorise it in distinct factors of length at most B?).
  2. paper [1, 1b] shows that SF remains NP-complete if we fix the bound B = 2 (more precisely: getting a word w, can we factorise it in distinct factors of length at most 2?).
  3. paper [3] shows that if both the alphabet and the bound B is fixed, then SF can be solved in polynomial-time.
  4. paper [2] shows that MF is NP-complete, but only if the alphabet is not restricted or fixed a priori! In particular, it does not answer the question if the problem is NP-complete if we only consider input words over some fixed alphabet (as is usual the case in practical settings).
  5. paper [3] shows that MF can be solved in polynomial time if the
    input bounds B are again upper bounded by some constant, i.e., the
    problem input is a word and a bound B from {1, 2, …, K}, where K
    is some fixed constant.

Some comments on these result: W.r.t. (1) and (2), it is intuitively clear that if the alphabet is binary, then, in order to make the problem SF difficult, the bound B cannot be fixed as well. Conversely, fixing B = 2 means that the alphabet size must get rather large in order to produce difficult instances. As a consequence, (3) is rather trivial (in fact, [3] says slightly more: we can then solve it in running time not only polynomial, but also |w|^2 times a factor that only depends on the alphabet size and bound B).
(5) is not difficult as well: If our word is long in comparison to B, then we can get the desired factorisation by simply slitting into factors of different lengths. If not, then we can brute-force all possibilities, which is exponential only in B, which in this case is assumed to be a constant.

So the picture we have is the following: SF seems more difficult, because we have hardness even for fixed alphabets or for a fixed bound B. The problem MF, on the other hand, gets poly-time solvable if the bound is fixed (in this regard it is easier than SF), while the corresponding question w.r.t. the alphabet size is open.
So MF is slightly less complex than SF, even if it turns out that MF for fixed alphabets is also NP-complete. However, if it can be shown that MF can be solved for fixed alphabets in poly-time, then MF is shown to be much easier than SF… because the one case for which it is hard is somewhat artificial (unbounded alphabet!).

I did put some effort into trying to resolve the case of MF with bounded alphabet, but I was not able to settle it and stopped working on it since. I do not believe that other researchers have tried very hard to solve it (so this is not one of these very hard open problems, many people have already tried and failed; I consider it somehow doable). My guess would be that it is also NP-hard for fixed alphabets, but maybe the reduction is so complicated that you would get something like “MF is hard for alphabets of size 35 or larger” or something, which would not be super nice either.

Regarding further literature, I know the paper [4], which considers the problem of splitting a word w into distinct factors u_1, u_2, …, u_k that are all palindromes, which is also NP-complete.

I had a quick look at paper [5], pointed out by Gilad. It seems to consider a different setting, though. In this paper, the authors are interested in the combinatorial question of how many distinct subsequences or subwords can be contained in a given word, but these can overlap. For example, aaabaab contains 20 different subwords a, b, aa, ab, ba, bb, aaa, aab, aba, baa, aaab, aaba, abaa, baab, aaaba, aabaa, abaab, aabaab, aaabaa, aaabaab (maybe I miscounted, but you get the idea). Some of them have only one occurrence, like baa, some of them several, like aa. In any case, the question is not how we can somehow split the word in order to get many distinct factors, since this means that each individual symbol contributes to exactly one factor.

Regarding practical solutions to these kind of problems (keep in mind that I am a theoretician, so take this with grain of salt):

  • To my knowledge, there are no theoretical lower bounds (like NP-hardness) that would rule it out to solve MF in polynomial time if we consider only input words over a fixed alphabet. There is one caveat, though: If you get a poly-time algorithm, then this should run exponentially in the number of symbols from the fixed alphabet (or exponential in some function of that)! Otherwise it would also be a polynomial time algorithm for the case of unbounded alphabets. So, being a theoretician, I would be looking for algorithmic tasks that can be computed in time exponential only if the number of symbols and that somehow help to devise an algorithm for MF.
    On the other hand, it is likely that such an algorithm does not exist and MF is also NP-hard in the fixed-alphabet case.

  • If you are interested in pratical solutions, it might be helpful to approximate the solution. So getting factorisation that are guaranteed to be only half as large as the optimum in the worst-case would not be too bad.

  • Heuristics that do not give a provable approximation ratio, but work well in a practical setting would also be interesting, I guess.

  • Transforming the problem instances into SAT or ILP-instances should not be too difficult and then you could run a SAT or ILP-Solver to even get optimal solutions.

  • My personal opinion is that even though it is not known whether the fixed-alphabet case of MF is NP-hard, there is enough theoretical insights that suggest that the problem is hard enough so that it is justified to look for heuristic solutions etc. that work well in a practical setting.


Bibliography:

[1] Anne Condon, Ján Manuch, Chris Thachuk: The complexity of string partitioning. J. Discrete Algorithms 32: 24-43 (2015)

[1b] Anne Condon, Ján Manuch, Chris Thachuk: Complexity of a Collision-Aware String Partition Problem and Its Relation to Oligo Design for Gene Synthesis. COCOON 2008: 265-275

[2] Henning Fernau, Florin Manea, Robert Mercas, Markus L. Schmid: Pattern Matching with Variables: Fast Algorithms and New Hardness Results. STACS 2015: 302-315

[3] Markus L. Schmid: Computing equality-free and repetitive string factorisations. Theor. Comput. Sci. 618: 42-51 (2016)

[4] Hideo Bannai, Travis Gagie, Shunsuke Inenaga, Juha Kärkkäinen, Dominik Kempa, Marcin Piatkowski, Shiho Sugimoto: Diverse Palindromic Factorization is NP-Complete. Int. J. Found. Comput. Sci. 29(2): 143-164 (2018)

[5] Abraham Flaxman, Aram Wettroth Harrow, Gregory B. Sorkin: Strings with Maximally Many Distinct Subsequences and Substrings. Electr. J. Comb. 11(1) (2004)

Answered By: Markus L. Schmid

Here is an graph-theory based answer.

Modeling
This problem can be modeled as a maximum independent set problem on a graph of size O(n²) as follows:
Let w = c_1, ..., c_n be the input string.
Let G = (V,E) be an undirected graph, built as follows:
V = { (a, b) such that a,b in [1, n], a <= b }. We can see that the size of V is n(n-1)/2, where each vertex represents a substring of w.
Then, for every couple of vertices (a1, b1)and (a2, b2), we build the edge ((a1, b1), (a2, b2)) iff
(i)[a1, b1] intersects [a2, b2] or
(ii)c_a1...c_b1 = c_a2...c_b2.
Said otherwise, we build an edge between two vertices if (i) the substrings they represent overlap in w or (ii) the two substrings are equal.

We can then see why a maximum independent set of G provides the answer to our problem.

Complexity
In the general case, the maximum independent set (MIS) problem is NP-hard, with a time complexity of O(1.1996^n) and in polynomial space [Xiao, NamaGoshi (2017)].
At first I thought that the resulting graph would be a chordal graph (no induced cycle of length > 3), which would have been very nice since then the MIS problem can be solved in linear time on this class of graphs.
But I quickly came to realize that it is not the case, it is quite easy to find examples where there are induced cycles of length 5 and more.
Actually, the resulting graph does not exhibit any ‘nice’ property that we usually look for and that allows to reduce the complexity of the MIS problem to a polynomial one.
This is only an upper bound on the complexity of the problem, since the polynomial time reduction goes only in one direction (we can reduce this problem to the MIS problem, but not the other way around, at least not trivially). So ultimately we end up solving this problem in O(1.1996^(n(n-1)/2)) in the worst case.
So, alas, I could not prove that it is in P, or that it is NP-complete or NP-hard. One sure thing is that the problem is in NP, but I guess this is not a surprise for anyone.

Implementation
The advantage of reducing this problem to the MIS problem is the MIS is a classical problem, for which several implementations can be found, and that the MIS problem is also easily written as an ILP.
Here is an ILP formulation of the MIS problem:

Objective function 
maximize sum(X[i], i in 1..n)
Constraints:
for all i in 1..n, X[i] in {0, 1}
for all edge (i, j), X[i] + X[j] <= 1

In my opinion, that should be the most efficient way to solve this problem (using this modeling as MIS problem), since ILP solver are incredibly efficient, especially when it comes to big instances.

This is an implementation I did using Python3 and the GLPK solver. To test it, you need a LP solver compatible with the Cplex file format.

from itertools import combinations

def edges_from_string(w):
    # build vertices
    vertices = set((a, b) for b in range(len(w)) for a in range(b+1))
    # build edges
    edges = {(a, b): set() for (a, b) in vertices}
    for (a1, b1), (a2, b2) in combinations(edges, 2):
        # case: substrings overlap
        if a1 <= a2 <= b1:
            edges[(a1, b1)].add((a2, b2))
        if a2 <= a1 <= b2:
            edges[(a2, b2)].add((a1, b1))
        # case: equal substrings
        if w[a1:b1+1] == w[a2:b2+1]:
            if a1 < a2:
                edges[(a1, b1)].add((a2, b2))
            else:
                edges[(a2, b2)].add((a1, b1))
    return edges

def write_LP_from_edges(edges, filename):
    with open(filename, 'w') as LP_file:
        LP_file.write('Maximize Z: ')
        LP_file.write("n".join([
            "+X%s_%s" % (a, b)
            for (a, b) in edges
        ]) + 'n')
        LP_file.write('nsubject to n')
        for (a1, b1) in edges:
            for (a2, b2) in edges[(a1, b1)]:
                LP_file.write(
                    "+X%s_%s + X%s_%s <= 1n" %
                    (a1, b1, a2, b2)
                )
        LP_file.write('nbinaryn')
        LP_file.write("n".join([
            "X%s_%s" % (a, b)
            for (a, b) in edges.keys()
        ]))
        LP_file.write('nendn')
write_LP_from_edges(edges_from_string('aababaa'), 'LP_file_1')
write_LP_from_edges(edges_from_string('kzshidfiouzh'), 'LP_file_2')

You can then solve them with the glpsol command:
glpsol --lp LP_file_1
The aababaa gets solved quickly (0.02 sec on my laptop), but as expected, things get (much) tougher as the string size grows ….
This program only gives the numeric value (and not the optimal partition), nevertheless the optimal partition and corresponding substrings can be found with a similar implementation, using a LP solver/python interface such as pyomo

Time & memory
aababaa: 0.02 seconds, 0.4 MB, value: 4
kzshidfiouzh: 1.4 seconds, 3.8 MB, value: 10
aababababbababab: 60.2 seconds, 31.5 MB, value: 8
kzshidfiouzhsdjfyu: 207.5 seconds, 55.7 MB, value: 14
Note that the LP solver also offers the current lower and upper bounds on the solution, so for the last example, I could get the actual solution as a lower bound after a minute.

Answered By: m.raynal
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.