Find multiple longest common leading substrings with length >= 4

Question:

In Python I am trying to extract all the longest common leading substrings that contain at least 4 characters from a list. For example, in the list called “data” below, the 2 longest common substrings that fit my criteria are “johnjack” and “detc”. I knew how to find the single longest common substring with the codes below, which returned nothing (as expected) because there is no common substring. But I am struggling with building a script that could detect multiple common substrings within a list, where each of the common substring must have length of 4 or above.

data = ['johnjack1', 'johnjack2', 'detc22', 'detc32', 'chunganh']

def ls(data):
    if len(data)==0:
        prefix = ''
    else:
        prefix = data[0]
    for i in data:
        while not i.startswith(prefix) and len(prefix) > 0:
            prefix = prefix[:-1]
    print(prefix)

ls(data) 
Asked By: Stanleyrr

||

Answers:

Here’s one, but I think it’s probably not the fastest or most efficient. Let’s start with just the data and a container for our answer:

data = ['johnjack1', 'johnjack2', 'detc22', 'detc32', 'chunganh', 'chunganh']
substrings = []

Note I added a dupe for chunganh — that’s a common edge case we should be handling.

See How do I find the duplicates in a list and create another list with them?

So to capture the duplicates in the data

seen = {}
dupes = []

for x in data:
    if x not in seen:
        seen[x] = 1
    else:
        if seen[x] == 1:
            dupes.append(x)
        seen[x] += 1

for dupe in dupes:
  substrings.append(dupe)

Now let’s record the unique values in the data as-is

# Capture the unique values in the data
last = set(data)

From here, we can loop through our set, popping characters off the end of each unique value. If the length of our set changes, we’ve found a unique substring.

# Handle strings up to 10000 characters long

for k in [0-b for b in range(1, 10000)]:
  # Use negative indexing to start from the longest
  last, middle = set([i[:k] for i in data]), last

  # Unique substring found
  if len(last) != len(middle):
    for k in last:
      count = 0
      for word in middle:
        if k in word:
          count += 1
      if count > 1:
        substrings.append(k)
  # Early stopping
  if len(last) == 1:
    break

Finally, you mentioned needing only substrings of length 4.

list(filter(lambda x: len(x) >= 4, substrings))
Answered By: Charles Landau

Very good recipe! Thank you @Charles Landau!

Answered By: theod
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.