How to get the regex code for words spilted by underscore?
Question:
i have a problem in my project.
WHAT I WANT TO DO ?
i need to get all word split by underscore in python code:
text = "i.am - M_o_h_a_m_m_e_d - and - _1_5_y_o - name - moh_mmed - 2_8_j - i___a_m"
re.findall(r'?????' , text)
i need to get :
['M_o_h_a_m_m_e_d','_1_5_y_o','i___a_m']
NOTE : if the word starts by a number like (15yo) will be a underscore before the starting number
WRONG : M_o_h_a_m_m_e_d , M_o_h_a_m_m_e_d
WRONG : 1_5_y_o
Answers:
You can use a regular expression pattern:
import re
text = "i.am - M_o_h_a_m_m_e_d - and - _1_5y_o - name - moh_mmed - 2_8_j - i___a_m"
result = re.findall(r"(?<!d)_w+", text)
print(result)
Here’s a non-regex approach:
text = "i.am - M_o_h_a_m_m_e_d - and - _1_5y_o - name - moh_mmed - 2_8_j - i___a_m"
result = []
for word in text.split(' - '):
if word[0].isdigit() or word[0] == '_' and not word[1].isdigit():
continue
for char in word.split('_'):
if len(char) > 1 and not (char[0].isdigit() ^ char[1].isdigit()):
break
else:
result.append(word)
print(result)
Output:
['M_o_h_a_m_m_e_d', '_1_5y_o', 'i___a_m']
Assuming the rules are:
- Words are split with underscores, except:
- If two subsequent characters are a number and a letter, and
- Numbers in the first position are prepended by an underscore.
EDIT: with the updated input from your comment, the code simplifies:
text = "i.am - M_o_h_a_m_m_e_d - and - _1_5_y_o - name - moh_mmed - 2_8_j - i___a_m"
result = [word for word in text.split(' - ') if not(word[0].isdigit() or word[0] == '_' and not word[1].isdigit() or any(len(char) > 1 for char in word.split('_')))]
print(result)
Output:
['M_o_h_a_m_m_e_d', '_1_5_y_o', 'i___a_m']
In case you want a pure regex solution, you can use following regex,
b(?:(?:[a-zA-Z]_+)+[a-zA-Z0-9]|(?:_+[a-zA-Z0-9])+)b
Explanation:
This regex is basically OR of two sub-regex which are explained below,
[a-zA-Z]_+)+[a-zA-Z0-9]
– Start with a single letter followed by one or more underscores and whole thing repeated one or more times and finally end the word with a letter or digit
_+[a-zA-Z0-9])+
– If the word starts with an underscore, then it should be followed by a letter or digit and the occurrence to be repeated one more more times
Then OR both regexes using |
and ?:
for making it a non-capture group using brackets (
and )
and then you get the overall regex as mentioned above.
i have a problem in my project.
WHAT I WANT TO DO ?
i need to get all word split by underscore in python code:
text = "i.am - M_o_h_a_m_m_e_d - and - _1_5_y_o - name - moh_mmed - 2_8_j - i___a_m"
re.findall(r'?????' , text)
i need to get :
['M_o_h_a_m_m_e_d','_1_5_y_o','i___a_m']
NOTE : if the word starts by a number like (15yo) will be a underscore before the starting number
WRONG : M_o_h_a_m_m_e_d , M_o_h_a_m_m_e_d
WRONG : 1_5_y_o
You can use a regular expression pattern:
import re
text = "i.am - M_o_h_a_m_m_e_d - and - _1_5y_o - name - moh_mmed - 2_8_j - i___a_m"
result = re.findall(r"(?<!d)_w+", text)
print(result)
Here’s a non-regex approach:
text = "i.am - M_o_h_a_m_m_e_d - and - _1_5y_o - name - moh_mmed - 2_8_j - i___a_m"
result = []
for word in text.split(' - '):
if word[0].isdigit() or word[0] == '_' and not word[1].isdigit():
continue
for char in word.split('_'):
if len(char) > 1 and not (char[0].isdigit() ^ char[1].isdigit()):
break
else:
result.append(word)
print(result)
Output:
['M_o_h_a_m_m_e_d', '_1_5y_o', 'i___a_m']
Assuming the rules are:
- Words are split with underscores, except:
- If two subsequent characters are a number and a letter, and
- Numbers in the first position are prepended by an underscore.
EDIT: with the updated input from your comment, the code simplifies:
text = "i.am - M_o_h_a_m_m_e_d - and - _1_5_y_o - name - moh_mmed - 2_8_j - i___a_m"
result = [word for word in text.split(' - ') if not(word[0].isdigit() or word[0] == '_' and not word[1].isdigit() or any(len(char) > 1 for char in word.split('_')))]
print(result)
Output:
['M_o_h_a_m_m_e_d', '_1_5_y_o', 'i___a_m']
In case you want a pure regex solution, you can use following regex,
b(?:(?:[a-zA-Z]_+)+[a-zA-Z0-9]|(?:_+[a-zA-Z0-9])+)b
Explanation:
This regex is basically OR of two sub-regex which are explained below,
[a-zA-Z]_+)+[a-zA-Z0-9]
– Start with a single letter followed by one or more underscores and whole thing repeated one or more times and finally end the word with a letter or digit_+[a-zA-Z0-9])+
– If the word starts with an underscore, then it should be followed by a letter or digit and the occurrence to be repeated one more more times
Then OR both regexes using |
and ?:
for making it a non-capture group using brackets (
and )
and then you get the overall regex as mentioned above.