how to prevent regex matching substring of words?
Question:
I have a regex in python and I want to prevent matching substrings. I want to add ‘@’ at the beginning some words with alphanumeric and _ character and 4 to 15 characters. But it matches substring of larger words. I have this method:
def add_atsign(sents):
for i, sent in enumerate(sents):
sents[i] = re.sub(r'([a-zA-Z0-9_]{4,15})', r'@1', str(sent))
return sents
And the example is :
mylist = list()
mylist.append("ali_s ali_t ali_u aabs:/t.co/kMMALke2l9")
add_atsign(mylist)
And the answer is :
['@ali_s @ali_t @ali_u @aabs:/t.co/@kMMALke2l9']
As you can see, it puts ‘@’ at the beginning of ‘aabs’ and ‘kMMALke2l9’. That it is wrong.
I tried to edit the code as bellow :
def add_atsign(sents):
for i, sent in enumerate(sents):
sents[i] = re.sub(r'((^|s)[a-zA-Z0-9_]{4,15}(s|$))', r'@1', str(sent))
return sents
But the result will become like this :
['@ali_s ali_t@ ali_u aabs:/t.co/kMMALke2l9']
As you can see It has wrong replacements.
The correct result I expect is:
"@ali_s @ali_t @ali_u aabs:/t.co/kMMALke2l9"
Could anyone help?
Thanks
Answers:
I am not sure what you are trying to accomplish, but the reason it puts the @ at the wrong places is that as you added /s or ^ to the regex the whitespace becomes part of the match and it therefore puts the @ before the whitespace.
you could try to split it to
- check at beginning of string and put at first position and
- check after every whitespace and put to second position
Im aware its not optimal, but maybe i can help if you clarify what the regex is supposed to match and what it shouldnt in a bit more detail
You can separate words by spaces by adding (?<=s) to the start and s to the end of your first expression.
def add_atsign(sents):
for i, sent in enumerate(sents):
sents[i] = re.sub(r'((^|(?<=s))[a-zA-Z0-9_]{4,15}s)', r'@1', str(sent))
return sents
The result will be like this:
['@ali_s @ali_t @ali_u aabs:/t.co/kMMALke2l9']
This is a pretty interesting question. If I understand correctly, the issue is that you want to divide the string by spaces, and then do the replacement only if the entire word matches, and not catch a substring.
I think the best way to do this is to first split by spaces, and then add assertions to your regex that catch only an entire string:
def add_atsign(sents):
new_list = []
for string in sents:
new_list.append(' '.join(re.sub(r'^([a-zA-Z0-9_]{4,15})$', r'@1', w)
for w in string.split()))
return new_list
mylist = ["ali_s ali_t ali_u aabs:/t.co/kMMALke2l9"]
add_atsign(mylist)
>
['@ali_s @ali_t @ali_u aabs:/t.co/kMMALke2l9']
ie, we split, then replace only if the entire word matches, then rejoin.
By the way, your regex can be simplified to r'^(w{4,15})$'
:
def add_atsign(sents):
new_list = []
for string in sents:
new_list.append(' '.join(re.sub(r'^(w{4,15})$', r'@1', w)
for w in string.split()))
return new_list
I have a regex in python and I want to prevent matching substrings. I want to add ‘@’ at the beginning some words with alphanumeric and _ character and 4 to 15 characters. But it matches substring of larger words. I have this method:
def add_atsign(sents):
for i, sent in enumerate(sents):
sents[i] = re.sub(r'([a-zA-Z0-9_]{4,15})', r'@1', str(sent))
return sents
And the example is :
mylist = list()
mylist.append("ali_s ali_t ali_u aabs:/t.co/kMMALke2l9")
add_atsign(mylist)
And the answer is :
['@ali_s @ali_t @ali_u @aabs:/t.co/@kMMALke2l9']
As you can see, it puts ‘@’ at the beginning of ‘aabs’ and ‘kMMALke2l9’. That it is wrong.
I tried to edit the code as bellow :
def add_atsign(sents):
for i, sent in enumerate(sents):
sents[i] = re.sub(r'((^|s)[a-zA-Z0-9_]{4,15}(s|$))', r'@1', str(sent))
return sents
But the result will become like this :
['@ali_s ali_t@ ali_u aabs:/t.co/kMMALke2l9']
As you can see It has wrong replacements.
The correct result I expect is:
"@ali_s @ali_t @ali_u aabs:/t.co/kMMALke2l9"
Could anyone help?
Thanks
I am not sure what you are trying to accomplish, but the reason it puts the @ at the wrong places is that as you added /s or ^ to the regex the whitespace becomes part of the match and it therefore puts the @ before the whitespace.
you could try to split it to
- check at beginning of string and put at first position and
- check after every whitespace and put to second position
Im aware its not optimal, but maybe i can help if you clarify what the regex is supposed to match and what it shouldnt in a bit more detail
You can separate words by spaces by adding (?<=s) to the start and s to the end of your first expression.
def add_atsign(sents):
for i, sent in enumerate(sents):
sents[i] = re.sub(r'((^|(?<=s))[a-zA-Z0-9_]{4,15}s)', r'@1', str(sent))
return sents
The result will be like this:
['@ali_s @ali_t @ali_u aabs:/t.co/kMMALke2l9']
This is a pretty interesting question. If I understand correctly, the issue is that you want to divide the string by spaces, and then do the replacement only if the entire word matches, and not catch a substring.
I think the best way to do this is to first split by spaces, and then add assertions to your regex that catch only an entire string:
def add_atsign(sents):
new_list = []
for string in sents:
new_list.append(' '.join(re.sub(r'^([a-zA-Z0-9_]{4,15})$', r'@1', w)
for w in string.split()))
return new_list
mylist = ["ali_s ali_t ali_u aabs:/t.co/kMMALke2l9"]
add_atsign(mylist)
>
['@ali_s @ali_t @ali_u aabs:/t.co/kMMALke2l9']
ie, we split, then replace only if the entire word matches, then rejoin.
By the way, your regex can be simplified to r'^(w{4,15})$'
:
def add_atsign(sents):
new_list = []
for string in sents:
new_list.append(' '.join(re.sub(r'^(w{4,15})$', r'@1', w)
for w in string.split()))
return new_list