Concatenating similar items in a list – Python

Question:

I have a list of similar and unique words. The similar words are appeared in one string and are separated by "|".

input = ["car | cat", "cat | caat", "car | caar", "dog", "ant | ants"]

I want to get the following output so that we could find out car, cat, caat, and caar are all similar instead of having pairs of similar words that have been repeated. So target output is like this:

output= ["car | cat | caat | caar", "dog" , "ant | ants"]

So far, I’ve managed to get ["car | cat | caat | caar", "dog", "ant", "ants"]. But I want to keep "ant | ants" intact since it doesn’t have any word in common with any other pairs.

Is someone able to write a python code to solve this problem?

Edit:

Here is the code to my attempt but I don’t want to make you feel that you should use the same approach.

def concat_common_words(input):
    my_list = input
    split_my_list = [x.split(" | ") for x in my_list]

    flat_my_list = [i for j in split_my_list for i in j]

    count_my_list = Counter(flat_my_list)

    common = [k for k, v in count_my_list.items() if v > 1]

    target_my_list = [x for x in my_list if any(c in x for c in common)]

    flat_target_my_list = set(sf for sfs in target_my_list for sf in sfs.split(" | "))

    merged = [" | ".join(flat_target_my_list)] 
    + list(set(flat_my_list) - flat_target_my_list) 

    return merged
concat_common_words(["car | cat", "cat | caat", "car | caar", "dog", "ant | ants"])

It returns ["car | cat | caat | caar", "dog" , "ant" , "ants"]
. But as I mentioned, I ant to keep "ant | ants" intact.

Asked By: y_e

||

Answers:

# I would create a set() for each group e.g. car | cat
# when adding a new group I would then merge with any existing group if
# they intersect.

data = ["car | cat", "cat | caat", "car | caar", "dog", "ant | ants"]


groups = []
for item in data:
    words = set(item.split(" | "))
    to_remove = []
    for existing_group in groups:
        if words.intersection(existing_group):
            words.update(existing_group)
            to_remove.append(existing_group)
    for removal in to_remove:
        groups.remove(removal)
    groups.append(words)

# convert groups back to pipe separated
final_groups = [" | ".join(group) for group in groups]
Answered By: Simon Ward-Jones

If you want to use the Levenshtein distance, proceed as follows:

from Levenshtein import distance as lev

data = ["car | cat", "cat | caat", "car | caar", "dog", "ant | ants"]

# set the desired threshold
threshold = 1

# create a unique set of trimmed strings
dataset = set([e.strip() for ls in [ s.split('|') for s in data ] for e in ls ])

# create a list of dicts to check already take strings
dl = [ { 'name': s, 'taken': False } for s in dataset ]

dd = []

for i in range(0, len(dl)):
    # check whether it is not taken
    if dl[i]['taken'] is False:
        ds = set()
        dl[i]['taken'] = True
        ds.add(dl[i]['name'])
        for j in range(i + 1, len(dl)):
            # check whether it is not taken and satisfying distance
            if dl[j]['taken'] is False and lev(dl[i]['name'], dl[j]['name']) <= threshold:
                dl[j]['taken'] = True
                ds.add(dl[j]['name'])
        dd.append(' | '.join(ds))
    
print(dd)

# output: ['caar | cat | caat', 'caar | car', 'car', 'ant | ants']
Answered By: Marco Riccetti
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.