Combining lists with overlapping elements

Question:

I have a collection of lists, some of which have overlapping elements:

coll = [['aaaa', 'aaab', 'abaa'],
        ['bbbb', 'bbbb'], 
        ['aaaa', 'bbbb'], 
        ['dddd', 'dddd'],
        ['bbbb', 'bbbb', 'cccc','aaaa'],
        ['eeee','eeef','gggg','gggi'],
        ['gggg','hhhh','iiii']]

I want to pool together only the lists that overlap, which would yield

pooled = [['aaaa', 'aaab', 'abaa','bbbb','cccc'], 
          ['eeee','eeef','gggg','gggi','hhhh','iiii'],
          ['dddd', 'dddd']]

(In case it’s not clear, the first and second list both overlap with the third and should therefore all be merged together, even though they don’t themselves contain elements in common.)

“Overlap” means two lists have at least one element in common. “Merge” means joining the two lists into either a single flat list or a single flat set.

There may be several sets, e.g. x, y, and z overlap with each other, v and w overlap with each other, but x+y+z don’t overlap with v+w. And there may be lists that don’t overlap with anything.

(An analogy is families. Join all the Montagues together, join all the Capulets together, but no Montague has ever married into the Capulets so those two clusters will remain distinct.)

I don’t care whether duplicate items are included multiple times or not.

What’s a simple and reasonably fast way of doing this in Python?

Edit: This does not seem to be a duplicate of Yet another merging list of lists, but most pythonic way because that doesn’t seem to consider groups that only overlap through a third set. The solutions I’ve tried from that question don’t yield the answer I’m looking for here.

Asked By: iayork

||

Answers:

Here’s a way to do it (assuming you want unique elements over the overlapping results):

def over(coll):
     print('Input is:n', coll)
     # gather the lists that do overlap 
     overlapping = [x for x in coll if any(x_element in [y for k in coll if k != x for y in k] for x_element in x)] 
     # flatten and get unique 
     overlapping = sorted(list(set([z for x in overlapping for z in x]))) 
     # get the rest
     non_overlapping = [x for x in coll if all(y not in overlapping for y in x)] 
     # use the line bellow only if merged non-overlapping elements are desired
     # non_overlapping = sorted([y for x in non_overlapping for y in x]) 
     print('Output is"n',[overlapping, non_overlapping])

coll = [['aaaa', 'aaab', 'abaa'],
        ['bbbb', 'bbbb'], 
        ['aaaa', 'bbbb'], 
        ['dddd', 'dddd'],
        ['bbbb', 'bbbb', 'cccc','aaaa']]
over(coll)
coll = [['aaaa', 'aaaa'], ['bbbb', 'bbbb']]
over(coll)

output:

$ python3 over.py                                                                                                                                                              -- NORMAL --
Input is:
 [['aaaa', 'aaab', 'abaa'], ['bbbb', 'bbbb'], ['aaaa', 'bbbb'], ['dddd', 'dddd'], ['bbbb', 'bbbb', 'cccc', 'aaaa']]
Output is"
 [['aaaa', 'aaab', 'abaa', 'bbbb', 'cccc'], [['dddd', 'dddd']]]
Input is:
 [['aaaa', 'aaaa'], ['bbbb', 'bbbb']]
Output is"
 [[], [['aaaa', 'aaaa'], ['bbbb', 'bbbb']]]


Answered By: npit

Working on a suggestion from alkasm in the comments, I used networkx:

import networkx as nx

coll = [['aaaa', 'aaab', 'abaa'],
        ['bbbb', 'bbbb'], 
        ['aaaa', 'bbbb'], 
        ['dddd', 'dddd'],
        ['bbbb', 'bbbb', 'cccc','aaaa'],
        ['eeee','eeef','gggg','gggi'],
        ['gggg','hhhh','iiii']]

edges = []
for i in range(len(coll)):
    a = coll[i]
    for j in range(len(coll)):
        if i != j:
            b = coll[j]
            if set(a).intersection(set(b)):
                edges.append((i,j))

G = nx.Graph()
G.add_nodes_from(range(len(coll)))
G.add_edges_from(edges)

for c in nx.connected_components(G):
    combined_lists = [coll[i] for i in c]
    flat_list = [item for sublist in combined_lists for item in sublist]
    print(set(flat_list))

Output:

{'cccc', 'bbbb', 'aaab', 'aaaa', 'abaa'}
{'dddd'}
{'eeef', 'eeee', 'hhhh', 'gggg', 'gggi', 'iiii'}

Undoubtedly this can be optimized but it seems to solve my problem for now.

Answered By: iayork

You can do this with sets using a successive merging approach:

coll = [['aaaa', 'aaab', 'abaa'],
        ['bbbb', 'bbbb'], 
        ['aaaa', 'bbbb'], 
        ['dddd', 'dddd'],
        ['bbbb', 'bbbb', 'cccc','aaaa'],
        ['eeee','eeef','gggg','gggi'],
        ['gggg','hhhh','iiii']]

pooled = [set(subList) for subList in coll]
merging = True
while merging:
    merging=False
    for i,group in enumerate(pooled):
        merged = next((g for g in pooled[i+1:] if g.intersection(group)),None)
        if not merged: continue
        group.update(merged)
        pooled.remove(merged)
        merging = True

print(pooled)
# [{'aaaa', 'abaa', 'aaab', 'cccc', 'bbbb'}, {'dddd'}, {'gggg', 'eeef', 'eeee', 'hhhh', 'gggi', 'iiii'}]
Answered By: Alain T.

I’m not sure how this code goes against others in regard of speed, but its quite simple.

for i in range(len(coll)-1,-1,-1):
    for j in range(i-1,-1,-1):
        if set(coll[j]) & set(coll[i]):
            coll[j].extend(coll.pop(i))
            break

output:

[['aaaa', 'aaab', 'abaa', 'bbbb', 'bbbb', 'aaaa', 'bbbb', 'bbbb', 'bbbb', 'cccc', 'aaaa'], 
 ['dddd', 'dddd'], 
 ['eeee', 'eeef', 'gggg', 'gggi', 'gggg', 'hhhh', 'iiii']]

what is good, it’s in place and outputs duplicated values in every sublist – both of those things can be easly changed:

  • just make duplicate of coll at the begining
  • coll = [list(set(x)) for x in cell] – to have only distinct elements in every sublist
Answered By: P.Świcarz
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.