Find most common substring in a list of strings?

Question:

I have a Python list of string names where I would like to remove a common substring from all of the names.

And after reading this similar answer I could almost achieve the desired result using SequenceMatcher.

But only when all items have a common substring:

From List:
string 1 = myKey_apples
string 2 = myKey_appleses
string 3 = myKey_oranges

common substring = "myKey_"

To List:
string 1 = apples
string 2 = appleses
string 3 = oranges

However I have a slightly noisy list that contains a few scattered items that don’t fit the same naming convention.

I would like to remove the “most common” substring from the majority:

From List:
string 1 = myKey_apples
string 2 = myKey_appleses
string 3 = myKey_oranges
string 4 = foo
string 5 = myKey_Banannas

common substring = ""

To List:
string 1 = apples
string 2 = appleses
string 3 = oranges
string 4 = foo
string 5 = Banannas

I need a way to match the “myKey_” substring so I can remove it from all names.

But when I use the SequenceMatcher the item “foo” causes the “longest match” to be equal to blank “”.

I think the only way to solve this is to find the “most common substring”. But how could that be accomplished?


Basic example code:

from difflib import SequenceMatcher

names = ["myKey_apples",
"myKey_appleses",
"myKey_oranges",
#"foo",
"myKey_Banannas"]

string2 = names[0]
for i in range(1, len(names)):
    string1 = string2
    string2 = names[i]
    match = SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2))

print(string1[match.a: match.a + match.size]) # -> myKey_
Asked By: Logic1

||

Answers:

Here’s a overly verbose solution to your problem:

def find_matching_key(list_in, max_key_only = True):
  """
  returns the longest matching key in the list * with the highest frequency
  """
  keys = {}
  curr_key = ''

  # If n does not exceed max_n, don't bother adding
  max_n = 0

  for word in list(set(list_in)): #get unique values to speed up
    for i in range(len(word)):
      # Look up the whole word, then one less letter, sequentially
      curr_key = word[0:len(word)-i]
      # if not in, count occurance
      if curr_key not in keys.keys() and curr_key!='':
        n = 0
        for word2 in list_in:
          if curr_key in word2:
            n+=1
        # if large n, Add to dictionary
        if n > max_n:
          max_n = n
          keys[curr_key] = n
    # Finish the word
  # Finish for loop  
  if max_key_only:
    return max(keys, key=keys.get)
  else:
    return keys    

# Create your "from list"
From_List = [
             "myKey_apples",
             "myKey_appleses",
             "myKey_oranges",
             "foo",
             "myKey_Banannas"
]

# Use the function
key = find_matching_key(From_List, True)

# Iterate over your list, replacing values
new_From_List = [x.replace(key,'') for x in From_List]

print(new_From_List)
['apples', 'appleses', 'oranges', 'foo', 'Banannas']

Needless to say, this solution would look a lot neater with recursion. Thought I’d sketch out a rough dynamic programming solution for you though.

Answered By: Yaakov Bressler

Given names = ["myKey_apples", "myKey_appleses", "myKey_oranges", "foo", "myKey_Banannas"]

An O(n^2) solution I can think of is to find all possible substrings and storing them in a dictionary with the number of times they occur :

substring_counts={}

for i in range(0, len(names)):
    for j in range(i+1,len(names)):
        string1 = names[i]
        string2 = names[j]
        match = SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2))
        matching_substring=string1[match.a:match.a+match.size]
        if(matching_substring not in substring_counts):
            substring_counts[matching_substring]=1
        else:
            substring_counts[matching_substring]+=1

print(substring_counts) #{'myKey_': 5, 'myKey_apples': 1, 'o': 1, '': 3}

And then picking the maximum occurring substring

import operator
max_occurring_substring=max(substring_counts.iteritems(), key=operator.itemgetter(1))[0]
print(max_occurring_substring) #myKey_
Answered By: Sruthi

I would first find the starting letter with the most occurrences. Then I would take each word having that starting letter, and take while all these words have matching letters. Then in the end I would remove the prefix that was found from each starting word:

from collections import Counter
from itertools import takewhile

strings = ["myKey_apples", "myKey_appleses", "myKey_oranges", "berries"]

def remove_mc_prefix(words):
    cnt = Counter()
    for word in words:
        cnt[word[0]] += 1
    first_letter = list(cnt)[0]

    filter_list = [word for word in words if word[0] == first_letter]
    filter_list.sort(key = lambda s: len(s)) # To avoid iob

    prefix = ""
    length = len(filter_list[0])
    for i in range(length):
        test = filter_list[0][i]
        if all([word[i] == test for word in filter_list]):
            prefix += test
        else: break
    return [word[len(prefix):] if word.startswith(prefix) else word for word in words]

print(remove_mc_prefix(strings))

Out: [‘apples’, ‘appleses’, ‘oranges’, ‘berries’]

Answered By: RoyM

To find the from list of

I already tested on I hope it will work for you.
I have the same use case but a different kind of task, I just need to find one from a list of more than 100s files. To use as a .

Your Basic example code is not working in my case. because 1st checking with 2nd, 2nd with 3rd, 3rd with 4th and so on. So, I change it to the most common substring and will check with each one.

The downside of this code is that if something is not common with the most common substring, the final most common substring will be an empty one.
But in my case, it is working.


from difflib import SequenceMatcher
for i in range(1, len(names)):
    if i==1:
        string1, string2 = names[0], names[i]
    else:
        string1, string2 = most_common_substring, names[i]
    match = SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2))
    most_common_substring = string1[match.a: match.a + match.size]

print(f"most_common_substring : {most_common_substring}")

Answered By: P_M
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.