Search start of the word using regular expression
Question:
How to write regular expression where we can find all words which are started by specified string. For ex-
a = "asasasa sasDRasas dr.klklkl DR.klklklkl Dr klklklkklkl"
Here I want to fetch all words which are starting by dr
using ignorecase. I tried but all functions results where dr
is found in word not start of the word.
Thanks in advance.
Answers:
You can use b
to find word boundaries, and the re.IGNORECASE
flag to search case-insensitively.
import re
a = "asasasa sasDRasas dr.klklkl DR.klklklkl Dr klklklkklkl"
for match in re.finditer(r'bdr', a, re.IGNORECASE):
print 'Found match: "{0}" at position {1}'.format(match.group(0), match.start())
This will output:
Found match: "dr" at position 18
Found match: "DR" at position 28
Found match: "Dr" at position 40
Here, the pattern bdr
matches dr, but only if it is found at the start of a word. This will also yield matches for strings like driving. If you only want to find dr as unique word, use bdrb
.
I use re.finditer()
to scan through the search string and yield every match for dr in a loop. The re.IGNORECASE
flag causes dr
to also match DR
, Dr
and dR
.
@Ferdinand Beyer’s answer shows how to do it by regex. But you can easily achieve that with string functions:
>>> a
'asasasa sasDRasas dr.klklkl DR.klklklkl Dr klklklkklkl'
>>> cleaned = "".join(" " if i in string.punctuation else i for i in a)
>>> cleaned
'asasasa sasDRasas dr klklkl DR klklklkl Dr klklklkklkl'
>>> [word for word in cleaned.split() if word.lower().startswith("dr")]
['dr', 'DR', 'Dr']
>>> string_to_search_in
'this a a dr.seuse dr.brown dr. oz dr noone'
>>> re.compile('b*?dr.?s*?w+', re.IGNORECASE).findall(string_to_search_in)
['dr.seuse', 'dr.brown', 'dr. oz', 'dr noone']
Yet another solution.
The expression will search and return the exact and starting with words from a string matched with a string variable.
import re
txt = "this a a dr.seuse dr.brown dr. oz dr noone"
suggtxt= "dr."
w_regex = r"b"+re.escape(suggtxt)+r"+S*"
x = re.findall(w_regex, txt, re.IGNORECASE)
print(x)
Output:
['dr.seuse', 'dr.brown', 'dr.']
How to write regular expression where we can find all words which are started by specified string. For ex-
a = "asasasa sasDRasas dr.klklkl DR.klklklkl Dr klklklkklkl"
Here I want to fetch all words which are starting by dr
using ignorecase. I tried but all functions results where dr
is found in word not start of the word.
Thanks in advance.
You can use b
to find word boundaries, and the re.IGNORECASE
flag to search case-insensitively.
import re
a = "asasasa sasDRasas dr.klklkl DR.klklklkl Dr klklklkklkl"
for match in re.finditer(r'bdr', a, re.IGNORECASE):
print 'Found match: "{0}" at position {1}'.format(match.group(0), match.start())
This will output:
Found match: "dr" at position 18 Found match: "DR" at position 28 Found match: "Dr" at position 40
Here, the pattern bdr
matches dr, but only if it is found at the start of a word. This will also yield matches for strings like driving. If you only want to find dr as unique word, use bdrb
.
I use re.finditer()
to scan through the search string and yield every match for dr in a loop. The re.IGNORECASE
flag causes dr
to also match DR
, Dr
and dR
.
@Ferdinand Beyer’s answer shows how to do it by regex. But you can easily achieve that with string functions:
>>> a
'asasasa sasDRasas dr.klklkl DR.klklklkl Dr klklklkklkl'
>>> cleaned = "".join(" " if i in string.punctuation else i for i in a)
>>> cleaned
'asasasa sasDRasas dr klklkl DR klklklkl Dr klklklkklkl'
>>> [word for word in cleaned.split() if word.lower().startswith("dr")]
['dr', 'DR', 'Dr']
>>> string_to_search_in
'this a a dr.seuse dr.brown dr. oz dr noone'
>>> re.compile('b*?dr.?s*?w+', re.IGNORECASE).findall(string_to_search_in)
['dr.seuse', 'dr.brown', 'dr. oz', 'dr noone']
Yet another solution.
The expression will search and return the exact and starting with words from a string matched with a string variable.
import re
txt = "this a a dr.seuse dr.brown dr. oz dr noone"
suggtxt= "dr."
w_regex = r"b"+re.escape(suggtxt)+r"+S*"
x = re.findall(w_regex, txt, re.IGNORECASE)
print(x)
Output:
['dr.seuse', 'dr.brown', 'dr.']