Splitting string using RegEx

Question:

I’m trying to use RegEx Tokenize to split the following string data:

"• General The general responsibilities of the role are XYZ • Physical Demands The Physical Demands of this role are XYZ • Education Requirements The education requirements for this role are • Bachelor's Degree • Appropriate Certification • Experience 5 years of experience is required"

I want to reach this as the final stage:

A header
• General The general responsibilities of the role are XYZ
• Physical Demands The Physical Demands of this role are XYZ
• Education Requirements The education requirements for this role are • Bachelor’s Degree • Appropriate Certification
• Experience 5 years of experience is required"

I’ve had success with grouping it, and parsing it, but it’s not as dynamic as I’d like.

There is a pattern I want to split by: words multiple spaces i.e. •.*?s{3,}

NOTE: one of the categories uses bullet points within it (Education Requirements). This is the part that I find most problematic.

Any help would be greatly appreciated! Perhaps RegEx Tokenize isn’t the most dynamic either.

Asked By: Benjamin Stringer

||

Answers:

You might use:

•s+[^s•].*?s{3,}.*?(?=•[^•n]*?s{3}|$)

Explanation

  • •s+ Match • and 1+ whitespace chars
  • [^s•].*? Match a non whitespace char other than • and then match any character, as few as possible
  • s{3,} Match 3 or more whitespace chars
  • .*? Match any character, as few as possible
  • (?= Positive lookahead, assert that to the right is
    • •[^•n]*?s{3} Match , then as few as possible chars other than • or a newline followed by 3 whitespace chars
    • | Or
    • $ End of string
  • ) Close the lookahead

See a regex101 demo and a Python demo

import re

s = "• General     The general responsibilities of the role are XYZ • Physical Demands     The Physical Demands of this role are XYZ • Education Requirements     The education requirements for this role are • Bachelor's Degree • Appropriate Certification • Experience     5 years of experience is required"
pattern = r"•s+[^s•].*?s{3,}.*?(?=•[^•n]*?s{3}|$)"
result = re.findall(pattern, s)
print(result)

Output

[
'• General     The general responsibilities of the role are XYZ ',
'• Physical Demands     The Physical Demands of this role are XYZ ',
"• Education Requirements     The education requirements for this role are • Bachelor's Degree • Appropriate Certification ",
'• Experience     5 years of experience is required'
]

Note that using s can also match a newline. If you don’t want to match newlines, you can use [^Sn] instead.

Answered By: The fourth bird
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.