how to prevent regex matching substring of words?

Question

I have a regex in python and I want to prevent matching substrings. I want to add ‘@’ at the beginning some words with alphanumeric and _ character and 4 to 15 characters. But it matches substring of larger words. I have this method:

def add_atsign(sents):
  for i, sent in enumerate(sents):
      sents[i] = re.sub(r'([a-zA-Z0-9_]{4,15})', r'@1', str(sent))
  return sents

And the example is :

mylist = list()
mylist.append("ali_s ali_t ali_u aabs:/t.co/kMMALke2l9")
add_atsign(mylist)

And the answer is :

['@ali_s @ali_t @ali_u @aabs:/t.co/@kMMALke2l9']

As you can see, it puts ‘@’ at the beginning of ‘aabs’ and ‘kMMALke2l9’. That it is wrong.
I tried to edit the code as bellow :

def add_atsign(sents):
  for i, sent in enumerate(sents):
      sents[i] = re.sub(r'((^|s)[a-zA-Z0-9_]{4,15}(s|$))', r'@1', str(sent))
  return sents

But the result will become like this :

['@ali_s ali_t@ ali_u aabs:/t.co/kMMALke2l9']

As you can see It has wrong replacements.
The correct result I expect is:

"@ali_s @ali_t @ali_u aabs:/t.co/kMMALke2l9"

Could anyone help?
Thanks

Asked By: HosseinSedghian

||

Source

Answer 1

I am not sure what you are trying to accomplish, but the reason it puts the @ at the wrong places is that as you added /s or ^ to the regex the whitespace becomes part of the match and it therefore puts the @ before the whitespace.

you could try to split it to

check at beginning of string and put at first position and
check after every whitespace and put to second position

Im aware its not optimal, but maybe i can help if you clarify what the regex is supposed to match and what it shouldnt in a bit more detail

Answered By: Luca

Answer 2

You can separate words by spaces by adding (?<=s) to the start and s to the end of your first expression.

def add_atsign(sents):
  for i, sent in enumerate(sents):
      sents[i] = re.sub(r'((^|(?<=s))[a-zA-Z0-9_]{4,15}s)', r'@1', str(sent))
  return sents

The result will be like this:

['@ali_s @ali_t @ali_u aabs:/t.co/kMMALke2l9']

Answered By: Aleksandr Golovaschenko

Answer 3

This is a pretty interesting question. If I understand correctly, the issue is that you want to divide the string by spaces, and then do the replacement only if the entire word matches, and not catch a substring.

I think the best way to do this is to first split by spaces, and then add assertions to your regex that catch only an entire string:

def add_atsign(sents):
    new_list = []
    for string in sents:
        new_list.append(' '.join(re.sub(r'^([a-zA-Z0-9_]{4,15})$', r'@1', w) 
                        for w in string.split()))
    return new_list

mylist = ["ali_s ali_t ali_u aabs:/t.co/kMMALke2l9"]
add_atsign(mylist)
>
['@ali_s @ali_t @ali_u aabs:/t.co/kMMALke2l9']

ie, we split, then replace only if the entire word matches, then rejoin.

By the way, your regex can be simplified to r'^(w{4,15})$':

def add_atsign(sents):
    new_list = []
    for string in sents:
        new_list.append(' '.join(re.sub(r'^(w{4,15})$', r'@1', w) 
                        for w in string.split()))
    return new_list

Answered By: Josh Friedlander

how to prevent regex matching substring of words?

Question:

Answers: