Count Each String In A List with One Character Mismatch
Question:
I have a list of strings:
my_list = 'AAA AAA BBB BBB DDD DDD DDA'.split()
my_list
['AAA', 'AAA', 'BBB', 'BBB', 'DDD', 'DDD', 'DDA']
I need to count every element appearing in the list. However, if two strings have one mismatch, we would count them as the same string and then count.
I mostly use the following script to count.
my_list.count('AAA')
However, not sure about how to implement the mismatch part. I am thinking to run two for loops
, compare two strings and then increment the count. It would be O(n^2).
Desired Output
AAA 2
BBB 2
DDD 3
DDA 3
What would be the ideal way of getting the desired output? Any suggestions would be appreciated. Thanks!
Answers:
Let’s start with an unoptimized method to test if two words are "close". You might lookup or import a real library that did "Levenshtein distance" rather than my half baked approach:
def is_close_enough(word1, word2): # Levenshtein Distance == 1 ?
if word1 == word2:
return True
if len(word1) != len(word2):
return False
return sum(c1==c2 for c1, c2 in zip(word1, word2)) >= len(word1) -1
print(is_close_enough("dog", "bog"))
print(is_close_enough("dog", "bot"))
print(is_close_enough("dog", "cat"))
print(is_close_enough("dog", "dogo"))
That should give you:
True
False
False
False
Now let’s try that in conjunction with your base list of words.
import collections
def is_close_enough(word1, word2): # Levenshtein Distance == 1 ?
if word1 == word2:
return True
if len(word1) != len(word2):
return False
return sum(c1==c2 for c1, c2 in zip(word1, word2)) >= len(word1) -1
my_list = 'AAA AAA BBB BBB DDD DDD DDA'.split()
my_list_counted = collections.Counter(my_list)
print({
word1: sum(
count2
for word2, count2
in my_list_counted.items()
if is_close_enough(word1, word2)
)
for word1
in my_list_counted
})
That should give you:
{'AAA': 2, 'BBB': 2, 'DDD': 3, 'DDA': 3}
Addendum:
If you had a specific list of interesting words to find rather than all matches you would iterate through it instead:
import collections
def is_close_enough(word1, word2): # Levenshtein Distance == 1 ?
if word1 == word2:
return True
if len(word1) != len(word2):
return False
return sum(c1==c2 for c1, c2 in zip(word1, word2)) >= len(word1) -1
my_interesting_words = ["AAA", "DDA"]
my_list = 'AAA AAA BBB BBB DDD DDD DDA'.split()
my_list_counted = collections.Counter(my_list)
print({
word1: sum(
count2
for word2, count2
in my_list_counted.items()
if is_close_enough(word1, word2)
)
for word1
in my_interesting_words
})
I have a list of strings:
my_list = 'AAA AAA BBB BBB DDD DDD DDA'.split()
my_list
['AAA', 'AAA', 'BBB', 'BBB', 'DDD', 'DDD', 'DDA']
I need to count every element appearing in the list. However, if two strings have one mismatch, we would count them as the same string and then count.
I mostly use the following script to count.
my_list.count('AAA')
However, not sure about how to implement the mismatch part. I am thinking to run two for loops
, compare two strings and then increment the count. It would be O(n^2).
Desired Output
AAA 2
BBB 2
DDD 3
DDA 3
What would be the ideal way of getting the desired output? Any suggestions would be appreciated. Thanks!
Let’s start with an unoptimized method to test if two words are "close". You might lookup or import a real library that did "Levenshtein distance" rather than my half baked approach:
def is_close_enough(word1, word2): # Levenshtein Distance == 1 ?
if word1 == word2:
return True
if len(word1) != len(word2):
return False
return sum(c1==c2 for c1, c2 in zip(word1, word2)) >= len(word1) -1
print(is_close_enough("dog", "bog"))
print(is_close_enough("dog", "bot"))
print(is_close_enough("dog", "cat"))
print(is_close_enough("dog", "dogo"))
That should give you:
True
False
False
False
Now let’s try that in conjunction with your base list of words.
import collections
def is_close_enough(word1, word2): # Levenshtein Distance == 1 ?
if word1 == word2:
return True
if len(word1) != len(word2):
return False
return sum(c1==c2 for c1, c2 in zip(word1, word2)) >= len(word1) -1
my_list = 'AAA AAA BBB BBB DDD DDD DDA'.split()
my_list_counted = collections.Counter(my_list)
print({
word1: sum(
count2
for word2, count2
in my_list_counted.items()
if is_close_enough(word1, word2)
)
for word1
in my_list_counted
})
That should give you:
{'AAA': 2, 'BBB': 2, 'DDD': 3, 'DDA': 3}
Addendum:
If you had a specific list of interesting words to find rather than all matches you would iterate through it instead:
import collections
def is_close_enough(word1, word2): # Levenshtein Distance == 1 ?
if word1 == word2:
return True
if len(word1) != len(word2):
return False
return sum(c1==c2 for c1, c2 in zip(word1, word2)) >= len(word1) -1
my_interesting_words = ["AAA", "DDA"]
my_list = 'AAA AAA BBB BBB DDD DDD DDA'.split()
my_list_counted = collections.Counter(my_list)
print({
word1: sum(
count2
for word2, count2
in my_list_counted.items()
if is_close_enough(word1, word2)
)
for word1
in my_interesting_words
})