How to get the regex code for words spilted by underscore?

Question:

i have a problem in my project.
WHAT I WANT TO DO ?
i need to get all word split by underscore in python code:

text = "i.am - M_o_h_a_m_m_e_d - and - _1_5_y_o - name - moh_mmed - 2_8_j - i___a_m"
re.findall(r'?????' , text)

i need to get :

['M_o_h_a_m_m_e_d','_1_5_y_o','i___a_m']

NOTE : if the word starts by a number like (15yo) will be a underscore before the starting number

WRONG : M_o_h_a_m_m_e_d , M_o_h_a_m_m_e_d

WRONG : 1_5_y_o

Asked By: Zaky202

||

Answers:

You can use a regular expression pattern:

import re

text = "i.am - M_o_h_a_m_m_e_d - and - _1_5y_o - name - moh_mmed - 2_8_j - i___a_m"
result = re.findall(r"(?<!d)_w+", text)

print(result)
Answered By: Alon Alush

Here’s a non-regex approach:

text = "i.am - M_o_h_a_m_m_e_d - and - _1_5y_o - name - moh_mmed - 2_8_j - i___a_m"
result = []

for word in text.split(' - '):
    if word[0].isdigit() or word[0] == '_' and not word[1].isdigit():
        continue
    for char in word.split('_'):
        if len(char) > 1 and not (char[0].isdigit() ^ char[1].isdigit()):
            break
    else:
        result.append(word)

print(result)

Output:

['M_o_h_a_m_m_e_d', '_1_5y_o', 'i___a_m']

Assuming the rules are:

  1. Words are split with underscores, except:
  2. If two subsequent characters are a number and a letter, and
  3. Numbers in the first position are prepended by an underscore.

EDIT: with the updated input from your comment, the code simplifies:

text = "i.am - M_o_h_a_m_m_e_d - and - _1_5_y_o - name - moh_mmed - 2_8_j - i___a_m"

result = [word for word in text.split(' - ') if not(word[0].isdigit() or word[0] == '_' and not word[1].isdigit() or any(len(char) > 1 for char in word.split('_')))]


print(result)

Output:

['M_o_h_a_m_m_e_d', '_1_5_y_o', 'i___a_m']
Answered By: B Remmelzwaal

In case you want a pure regex solution, you can use following regex,

b(?:(?:[a-zA-Z]_+)+[a-zA-Z0-9]|(?:_+[a-zA-Z0-9])+)b

Explanation:

This regex is basically OR of two sub-regex which are explained below,

  • [a-zA-Z]_+)+[a-zA-Z0-9] – Start with a single letter followed by one or more underscores and whole thing repeated one or more times and finally end the word with a letter or digit
  • _+[a-zA-Z0-9])+ – If the word starts with an underscore, then it should be followed by a letter or digit and the occurrence to be repeated one more more times

Then OR both regexes using | and ?: for making it a non-capture group using brackets ( and ) and then you get the overall regex as mentioned above.

Demo

Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.