Recursive function to search for a substring, ignoring punctuation, but retruning the indexes of the substring, including punctuation

Question:

I am trying, but struggling, to write an algorithm that checks if a substring exists in a piece of text. The piece of text can include punctuation, but only alphanumeric characters should be considered when searching for the substring.

I would like to return the start and end index of the substring. It is guaranteed that the substring exists in the text. However, these indexes should also account for punctuation that was ignored in the search.

For example, for the text BD ACA;B_ 1E and the substring AB1, the algorithm should return 5 and 11 as the start and end index. (text[5:11] -> A;B_ 1 == AB1 with punctuation removed.)

This is the best I have done so far.

def search(text, sub):
    print(text, sub)
    if not sub:
        return True

    for i, char in enumerate(text):
        if char == sub[0]:
            return search(text[1:], sub[1:])
        else:
            return search(text[1:], sub)


result = search("BD ACA;B_ 1E", "AB1")
print(result)
  1. I am printing out the two strings at the entry point to the function just to see how it progresses.
  2. If the first n chars of the substring are found, but not the rest, then the search doesn’t restart to search for the entire substring again. This is what I was attempting to do with the for loop, but can’t get it working. This means that it will return true if call characters of the substring simply exist somewhere, in order, in the text.
  3. I have just tried to initially get the search working successfully, I haven’t even tried to track and return the start and end indexes yet.
Asked By: KOB

||

Answers:

You can use a regular expression for that. [W_]* will match any non-alphanumeric character sequence, so you could alternate the letters of the search string with that pattern:

import re

def search(text, sub):
    match = re.search(r"[W_]*".join(sub), text)
    return match and match.span(0)

text = "BD ACA;B_ 1E"
span = search(text, "AB1")
if span:
    start, end = span
    print(start, end)
    print(text[start:end])
Answered By: trincot

There is a string function isalnum() that can check for alphanumeric chars.

t = 'BD ACA;B_ 1'

s = 'AB1'

def is_in_st(t, s):

    start = t.rfind(s[0])
    end = t.rfind(s[-1]) + 1

    if start and end:

        return s == ''.join([c for c in t[start:end] if c.isalnum()])

is_in_st(t, s)

True
Answered By: LetzerWille
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.