Using regular expression to match and replace
Question:
There is a list of string A which is some how matching with another list of string B. I wanted to replace string A with list of matching string B using regular expression. However I am not getting the correct result.
The solution should be A == ["Yogesh","Numita","Hero","Yogesh"]
.
import re
A = ["yogeshgovindan","TNumita","Herohonda","Yogeshkumar"]
B=["Yogesh","Numita","Hero"]
for i in A:
for j in B:
replaced=re.sub('i','j',i)
print(replaced)
Answers:
this one works to me:
lst=[]
for a in A:
lst.append([b for b in B if b.lower() in a.lower()][0])
This returns element from list B if it is found at A list. It’s necessary to compare lowercased words. The [0]
is added for getting string instead of list from comprehension list.
If looping over B
, you don’t need a regular expression; you can simply use membership testing.
A regex might result in better performance, as membership testing will scan each string in A
for every string in B
, resulting in O(len(A) * len(B) performance)
.
As long as the individual terms don’t contain any metacharacters and can appear in any context, the simplest way to form the regex is to join the entries of B
with the alternation operation:
reTerms = re.compile('|'.join(B), re.I)
However, to be safe, the entries should first be escaped, in case any contains a metacharacter:
# map-based
reTerms = re.compile('|'.join(map(re.escape, B)), re.I)
# comprehension-based
reTerms = re.compile('|'.join([re.escape(b) for b in B]), re.I)
If there is any restrictions on the context the terms appear in, sub-patterns for the restrictions would need to be prepended and appended to the pattern. For example, if the terms must appear as full words:
reTerms = re.compile(f"b(?:{'|'.join(map(re.escape, B))})b", re.I)
This regex can be applied to each item of A
to get the matching text:
replaced = [reTerms.search(name).group(0) for name in A]
# result: ['yogesh', 'Numita', 'Hero', 'Yogesh']
Since the terms in the regex are straight string matches, the content will be correct, but the case may not. This could be corrected by a normalization step, passing the matched text through a dict
:
normed = {term.lower():term for term in B}
replaced = [normed[reTerms.search(name).group(0).lower()] for name in A]
# result: ['Yogesh', 'Numita', 'Hero', 'Yogesh']
One issue remains: what if an item of A
doesn’t match? Then reTerms.search
returns None
, which doesn’t have a group
attribute. If None
-propagating attribute access is added to Python (such as suggested by PEP 505), this would be easily addressed by using such:
names = ["yogeshgovindan","TNumita","Herohonda","Yogeshkumar", "hrithikroshan"]
normed[None] = None
replaced = [normed[reTerms.search(name)?.group(0).lower()] for name in names]
In the absence of such a feature, there are various approaches, such as using a ternary expression and walrus assignment. In the sample below, a list is used as a stand-in to provide a default value for the match:
import re
names = ["yogeshgovindan","TNumita","Herohonda","Yogeshkumar", "hrithikroshan"]
terms = ["Yogesh","Numita","Hero"]
normed = {term.lower():term for term in terms}
normed[''] = None
reTerms = re.compile('|'.join(map(re.escape, terms)), re.I)
# index may need to be changed if `reTerms` includes any context
[normed[(reTerms.search(sentence) or [''])[0].lower()] for sentence in sentences]
There is a list of string A which is some how matching with another list of string B. I wanted to replace string A with list of matching string B using regular expression. However I am not getting the correct result.
The solution should be A == ["Yogesh","Numita","Hero","Yogesh"]
.
import re
A = ["yogeshgovindan","TNumita","Herohonda","Yogeshkumar"]
B=["Yogesh","Numita","Hero"]
for i in A:
for j in B:
replaced=re.sub('i','j',i)
print(replaced)
this one works to me:
lst=[]
for a in A:
lst.append([b for b in B if b.lower() in a.lower()][0])
This returns element from list B if it is found at A list. It’s necessary to compare lowercased words. The [0]
is added for getting string instead of list from comprehension list.
If looping over B
, you don’t need a regular expression; you can simply use membership testing.
A regex might result in better performance, as membership testing will scan each string in A
for every string in B
, resulting in O(len(A) * len(B) performance)
.
As long as the individual terms don’t contain any metacharacters and can appear in any context, the simplest way to form the regex is to join the entries of B
with the alternation operation:
reTerms = re.compile('|'.join(B), re.I)
However, to be safe, the entries should first be escaped, in case any contains a metacharacter:
# map-based
reTerms = re.compile('|'.join(map(re.escape, B)), re.I)
# comprehension-based
reTerms = re.compile('|'.join([re.escape(b) for b in B]), re.I)
If there is any restrictions on the context the terms appear in, sub-patterns for the restrictions would need to be prepended and appended to the pattern. For example, if the terms must appear as full words:
reTerms = re.compile(f"b(?:{'|'.join(map(re.escape, B))})b", re.I)
This regex can be applied to each item of A
to get the matching text:
replaced = [reTerms.search(name).group(0) for name in A]
# result: ['yogesh', 'Numita', 'Hero', 'Yogesh']
Since the terms in the regex are straight string matches, the content will be correct, but the case may not. This could be corrected by a normalization step, passing the matched text through a dict
:
normed = {term.lower():term for term in B}
replaced = [normed[reTerms.search(name).group(0).lower()] for name in A]
# result: ['Yogesh', 'Numita', 'Hero', 'Yogesh']
One issue remains: what if an item of A
doesn’t match? Then reTerms.search
returns None
, which doesn’t have a group
attribute. If None
-propagating attribute access is added to Python (such as suggested by PEP 505), this would be easily addressed by using such:
names = ["yogeshgovindan","TNumita","Herohonda","Yogeshkumar", "hrithikroshan"]
normed[None] = None
replaced = [normed[reTerms.search(name)?.group(0).lower()] for name in names]
In the absence of such a feature, there are various approaches, such as using a ternary expression and walrus assignment. In the sample below, a list is used as a stand-in to provide a default value for the match:
import re
names = ["yogeshgovindan","TNumita","Herohonda","Yogeshkumar", "hrithikroshan"]
terms = ["Yogesh","Numita","Hero"]
normed = {term.lower():term for term in terms}
normed[''] = None
reTerms = re.compile('|'.join(map(re.escape, terms)), re.I)
# index may need to be changed if `reTerms` includes any context
[normed[(reTerms.search(sentence) or [''])[0].lower()] for sentence in sentences]