Pythonic way to merge two overlapping lists, preserving order

Question:

Alright, so I have two lists, as such:

  • They can and will have overlapping items, for example, [1, 2, 3, 4, 5], [4, 5, 6, 7].
  • There will not be additional items in the overlap, for example, this will not happen: [1, 2, 3, 4, 5], [3.5, 4, 5, 6, 7]
  • The lists are not necessarily ordered nor unique. [9, 1, 1, 8, 7], [8, 6, 7].

I want to merge the lists such that existing order is preserved, and to merge at the last possible valid position, and such that no data is lost. Additionally, the first list might be huge. My current working code is as such:

master = [1,3,9,8,3,4,5]
addition = [3,4,5,7,8]

def merge(master, addition):
    n = 1
    while n < len(master):
        if master[-n:] == addition[:n]:
            return master + addition[n:]
        n += 1
    return master + addition

What I would like to know is – is there a more efficient way of doing this? It works, but I’m slightly leery of this, because it can run into large runtimes in my application – I’m merging large lists of strings.

EDIT: I’d expect the merge of [1,3,9,8,3,4,5], [3,4,5,7,8] to be: [1,3,9,8,3,4,5,7,8]. For clarity, I’ve highlighted the overlapping portion.

[9, 1, 1, 8, 7], [8, 6, 7] should merge to [9, 1, 1, 8, 7, 8, 6, 7]

Asked By: Firnagzen

||

Answers:

You can try the following:

>>> a = [1, 3, 9, 8, 3, 4, 5]
>>> b = [3, 4, 5, 7, 8]

>>> matches = (i for i in xrange(len(b), 0, -1) if b[:i] == a[-i:])
>>> i = next(matches, 0)
>>> a + b[i:]
[1, 3, 9, 8, 3, 4, 5, 7, 8]

The idea is we check the first i elements of b (b[:i]) with the last i elements of a (a[-i:]). We take i in decreasing order, starting from the length of b until 1 (xrange(len(b), 0, -1)) because we want to match as much as possible. We take the first such i by using next and if we don’t find it we use the zero value (next(..., 0)). From the moment we found the i, we add to a the elements of b from index i.

Answered By: JuniorCompressor

There are a couple of easy optimizations that are possible.

  1. You don’t need to start at master[1], since the longest overlap starts at master[-len(addition)]

  2. If you add a call to list.index you can avoid creating sub-lists and comparing lists for each index:

This approach keeps the code pretty understandable too (and easier to optimize by using cython or pypy):

master = [1,3,9,8,3,4,5]
addition = [3,4,5,7,8]

def merge(master, addition):
    first = addition[0]
    n = max(len(master) - len(addition), 1)  # (1)
    while 1:
        try:
            n = master.index(first, n)       # (2)
        except ValueError:
            return master + addition

        if master[-n:] == addition[:n]:
            return master + addition[n:]
        n += 1
Answered By: thebjorn

One trivial optimization is not iterating over the whole master list. I.e., replace while n < len(master) with for n in range(min(len(addition), len(master))) (and don’t increment n in the loop). If there is no match, your current code will iterate over the entire master list, even if the slices being compared aren’t even of the same length.

Another concern is that you’re taking slices of master and addition in order to compare them, which creates two new lists every time, and isn’t really necessary. This solution (inspired by Boyer-Moore) doesn’t use slicing:

def merge(master, addition):
    overlap_lens = (i + 1 for i, e in enumerate(addition) if e == master[-1])
    for overlap_len in overlap_lens:
        for i in range(overlap_len):
            if master[-overlap_len + i] != addition[i]:
                break
        else:
            return master + addition[overlap_len:]
    return master + addition

The idea here is to generate all the indices of the last element of master in addition, and add 1 to each. Since a valid overlap must end with the last element of master, only those values are lengths of possible overlaps. Then we can check for each of them if the elements before it also line up.

The function currently assumes that master is longer than addition (you’ll probably get an IndexError at master[-overlap_len + i] if it isn’t). Add a condition to the overlap_lens generator if you can’t guarantee it.

It’s also non-greedy, i.e. it looks for the smallest non-empty overlap (merge([1, 2, 2], [2, 2, 3]) will return [1, 2, 2, 2, 3]). I think that’s what you meant by “to merge at the last possible valid position”. If you want a greedy version, reverse the overlap_lens generator.

Answered By: dddsnn

I don’t offer optimizations but another way of looking at the problem. To me, this seems like a particular case of http://en.wikipedia.org/wiki/Longest_common_substring_problem where the substring would always be at the end of the list/string. The following algorithm is the dynamic programming version.

def longest_common_substring(s1, s2):
    m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
    longest, x_longest = 0, 0
    for x in xrange(1, 1 + len(s1)):
        for y in xrange(1, 1 + len(s2)):
            if s1[x - 1] == s2[y - 1]:
                m[x][y] = m[x - 1][y - 1] + 1
                if m[x][y] > longest:
                    longest = m[x][y]
                    x_longest = x
            else:
                m[x][y] = 0
    return x_longest - longest, x_longest

master = [1,3,9,8,3,4,5]
addition = [3,4,5,7,8]
s, e = longest_common_substring(master, addition)
if e - s > 1:
    print master[:s] + addition

master = [9, 1, 1, 8, 7]
addition = [8, 6, 7]
s, e = longest_common_substring(master, addition)
if e - s > 1:
    print master[:s] + addition
else:
    print master + addition

[1, 3, 9, 8, 3, 4, 5, 7, 8]
[9, 1, 1, 8, 7, 8, 6, 7]
Answered By: Ale

First of all and for clarity, you can replace your while loop with a for loop:

def merge(master, addition):
    for n in xrange(1, len(master)):
        if master[-n:] == addition[:n]:
            return master + addition[n:]
    return master + addition

Then, you don’t have to compare all possible slices, but only those for which master‘s slice starts with the first element of addition:

def merge(master, addition):
    indices = [len(master) - i for i, x in enumerate(master) if x == addition[0]]
    for n in indices:
        if master[-n:] == addition[:n]:
            return master + addition[n:]
    return master + addition

So instead of comparing slices like this:

1234123141234
            3579
           3579
          3579
         3579
        3579
       3579
      3579
     3579
    3579
   3579
  3579
 3579
3579

you are only doing these comparisons:

1234123141234
  |   |    |
  |   |    3579
  |   3579
  3579

How much this will speed up your program depends on the nature of your data: the fewer repeated elements your lists have, the better.

You could also generate a list of indices for addition so its own slices always end with master‘s last element, further restricting the number of comparisons.

Answered By: Roberto Bonvallet

Based on https://stackoverflow.com/a/30056066/541208:

def join_two_lists(a, b):
  index = 0
  for i in xrange(len(b), 0, -1):
    #if everything from start to ith of b is the 
    #same from the end of a at ith append the result
    if b[:i] == a[-i:]:
        index = i
        break

  return a + b[index:]
Answered By: TankorSmash

This actually isn’t too terribly difficult. After all, essentially all you’re doing is checking what substring at the end of A lines up with what substring of B.

def merge(a, b):
    max_offset = len(b)  # can't overlap with greater size than len(b)
    for i in reversed(range(max_offset+1)):
        # checks for equivalence of decreasing sized slices
        if a[-i:] == b[:i]:
            break
    return a + b[i:]

We can test with your test data by doing:

test_data = [{'a': [1,3,9,8,3,4,5], 'b': [3,4,5,7,8], 'result': [1,3,9,8,3,4,5,7,8]},
             {'a': [9, 1, 1, 8, 7], 'b': [8, 6, 7], 'result': [9, 1, 1, 8, 7, 8, 6, 7]}]

all(merge(test['a'], test['b']) == test['result'] for test in test_data)

This runs through every possible combination of slices that could result in an overlap and remembers the result of the overlap if one is found. If nothing is found, it uses the last result of i which will always be 0. Either way, it returns all of a plus everything past b[i] (in the overlap case, that’s the non overlapping portion. In the non-overlap case, it’s everything)

Note that we can make a couple optimizations in corner cases. For instance, the worst case here is that it runs through the whole list without finding any solution. You could add a quick check at the beginning that might short circuit that worst case

def merge(a, b):
    if a[-1] not in b:
        return a + b
    ...

In fact you could take that solution one step further and probably make your algorithm much faster

def merge(a, b):
    while True:
        try:
            idx = b.index(a[-1]) + 1  # leftmost occurrence of a[-1] in b
        except ValueError:  # a[-1] not in b
            return a + b
        if a[-idx:] == b[:idx]:
            return a + b[:idx]

However this might not find the longest overlap in cases like:

a = [1,2,3,4,1,2,3,4]
b = [3,4,1,2,3,4,5,6]
# result should be [1,2,3,4,1,2,3,4,5,6], but
# this algo produces [1,2,3,4,1,2,3,4,1,2,3,4,5,6]

You could fix that be using rindex instead of index to match the longest slice instead of the shortest, but I’m not sure what that does to your speed. It’s certainly slower, but it might be inconsequential. You could also memoize the results and return the shortest result, which might be a better idea.

def merge(a, b):
    results = []
    while True:
        try:
            idx = b.index(a[-1]) + 1  # leftmost occurrence of a[-1] in b
        except ValueError:  # a[-1] not in b
            results.append(a + b)
            break
        if a[-idx:] == b[:idx]:
            results.append(a + b[:idx])
    return min(results, key=len)

Which should work since merging the longest overlap should produce the shortest result in all cases.

Answered By: Adam Smith

All the above solutions are similar in terms of using a for / while loop for the merging task. I first tried the solutions by @JuniorCompressor and @TankorSmash, but these solutions are way too slow for merging two large-scale lists (e.g. lists with about millions of elements).

I found using pandas to concatenate lists with large size is much more time-efficient:

import pandas as pd, numpy as np

trainCompIdMaps = pd.DataFrame( { "compoundId": np.random.permutation( range(800) )[0:80], "partition": np.repeat( "train", 80).tolist()} )

testCompIdMaps = pd.DataFrame( {"compoundId": np.random.permutation( range(800) )[0:20], "partition": np.repeat( "test", 20).tolist()} )

# row-wise concatenation for two pandas
compoundIdMaps = pd.concat([trainCompIdMaps, testCompIdMaps], axis=0)

mergedCompIds = np.array(compoundIdMaps["compoundId"])
Answered By: Good Will

What you need is a sequence alignment algorithm like Needleman-Wunsch.

Needleman-Wunsch is a global sequence alignment algorithm based on dynamic programming:
Needleman-Wunsch matrix; Source: Wikipedia

I found this nice implementation to merge arbitrary object sequences in python:
https://github.com/ajnisbet/paired

import paired

seq_1 = 'The quick brown fox jumped over the lazy dog'.split(' ')
seq_2 = 'The brown fox leaped over the lazy dog'.split(' ')
alignment = paired.align(seq_1, seq_2)

print(alignment)
# [(0, 0), (1, None), (2, 1), (3, 2), (4, 3), (5, 4), (6, 5), (7, 6), (8, 7)]

for i_1, i_2 in alignment:
    print((seq_1[i_1] if i_1 is not None else '').ljust(15), end='')
    print(seq_2[i_2] if i_2 is not None else '')

# The            The
# quick          
# brown          brown
# fox            fox
# jumped         leaped
# over           over
# the            the
# lazy           lazy
# dog            dog
Answered By: Hoeze
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.