Count Each String In A List with One Character Mismatch

Question:

I have a list of strings:

my_list = 'AAA AAA BBB BBB DDD DDD DDA'.split()
my_list

['AAA', 'AAA', 'BBB', 'BBB', 'DDD', 'DDD', 'DDA']

I need to count every element appearing in the list. However, if two strings have one mismatch, we would count them as the same string and then count.

I mostly use the following script to count.

my_list.count('AAA')

However, not sure about how to implement the mismatch part. I am thinking to run two for loops, compare two strings and then increment the count. It would be O(n^2).

Desired Output

AAA 2
BBB 2
DDD 3
DDA 3

What would be the ideal way of getting the desired output? Any suggestions would be appreciated. Thanks!

Asked By: Roy

||

Answers:

Let’s start with an unoptimized method to test if two words are "close". You might lookup or import a real library that did "Levenshtein distance" rather than my half baked approach:

def is_close_enough(word1, word2):    # Levenshtein Distance == 1 ?
    if word1 == word2:
        return True

    if len(word1) != len(word2):
        return False

    return sum(c1==c2 for c1, c2 in zip(word1, word2)) >= len(word1) -1

print(is_close_enough("dog", "bog"))
print(is_close_enough("dog", "bot"))
print(is_close_enough("dog", "cat"))
print(is_close_enough("dog", "dogo"))

That should give you:

True
False
False
False

Now let’s try that in conjunction with your base list of words.

import collections

def is_close_enough(word1, word2):    # Levenshtein Distance == 1 ?
    if word1 == word2:
        return True

    if len(word1) != len(word2):
        return False

    return sum(c1==c2 for c1, c2 in zip(word1, word2)) >= len(word1) -1

my_list = 'AAA AAA BBB BBB DDD DDD DDA'.split()
my_list_counted = collections.Counter(my_list)

print({
    word1: sum(
        count2
        for word2, count2
        in my_list_counted.items()
        if is_close_enough(word1, word2)
    )
    for word1
    in my_list_counted
})

That should give you:

{'AAA': 2, 'BBB': 2, 'DDD': 3, 'DDA': 3}

Addendum:

If you had a specific list of interesting words to find rather than all matches you would iterate through it instead:

import collections

def is_close_enough(word1, word2):    # Levenshtein Distance == 1 ?
    if word1 == word2:
        return True

    if len(word1) != len(word2):
        return False

    return sum(c1==c2 for c1, c2 in zip(word1, word2)) >= len(word1) -1

my_interesting_words = ["AAA", "DDA"]
my_list = 'AAA AAA BBB BBB DDD DDD DDA'.split()
my_list_counted = collections.Counter(my_list)

print({
    word1: sum(
        count2
        for word2, count2
        in my_list_counted.items()
        if is_close_enough(word1, word2)
    )
    for word1
    in my_interesting_words
})
Answered By: JonSG
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.