Regular Expression to split text based on different patterns (within a single expression)
Question:
I have some patterns which detect questions and splits on top of that. there are some assumptions which I’m using like:
- Every pattern starts with a
n
- Every pattern ends with
s+
And how I define a pattern is like:
<NUM>.
Q <NUM>.
Q <NUM>
<Q.NUM.>
<NUM>
Question <NUM>
<Example>
Problem <NUM>
Problem:
<Alphabet><Number>.
<EXAMPLE>
Example <NUM>
Someone suggested the below regex: try the demo
((Q|Question|Problem:?|Example|EXAMPLE).? ?d+.? ?|(Question|Problem:?|Example|EXAMPLE) ?)
but it captures patterns in the middle which is problematic for me because I can have Q.
, Example. 2
in the middle of the string too and is not capturing <NUM>.
This list is based on priority so what I could come up with is building these many expressions and running a loop based on the priority for example:
QUESTIONS = [
re.compile("nd+."),
re.compile("nQ.s*d+."),
re.compile("nExample.s*d+.")
]
but it is very inefficient. How can I club these in one expression?
HERE IS THE TEST STRING:
'TEStlabZnEDULABZnINTERNATIONALnLOGARITHMS AND INDICESnnQ.1. (A) Convert each of the following to logarithmic form.n(i) \( 5^{2}=25 \)n(ii) \( 3^{-3}=\frac{1}{27} \)n(iii) \( (64)^{\frac{1}{3}}=4 \)n(iv) \( 6^{0}=1 \)n(v) \( 10^{-2}=0.01 \) (vi) \( 4^{-1}=\frac{1}{4} \)nAns. We know that \( a^{b}=x \Rightarrow b=\log _{a} x \)n(i) \( 5^{2}=25 \quad \therefore \log _{5} 25=2 \)n(ii) \( 3^{-3}=\frac{1}{27} \therefore \log _{3}\left(\frac{1}{27}\right)=-3 \)n(iii) \( (64)^{\frac{1}{3}}=4 \therefore \log _{64} 4=\frac{1}{3} \)n(iv) \( 6^{0}=1 \quad \therefore \log _{6} 1=0 \)n(v) \( 10^{-2}=0.01 \therefore \log _{10}(0.01)=-2 \)n(vi) \( 4^{-1}=\frac{1}{4} \therefore \log _{4}\left(\frac{1}{4}\right)=-1 \)nQ.1. (B) Convert each of the following to exponential form.n(i) \( \log _{3} 81=4 \)n(ii) \( \log _{8} 4=\frac{2}{3} \)n(iii) \( \log _{2} \frac{1}{8}=-3 \)n(iv) \( \log _{10}(0.01)=-2 \)n(v) \( \log _{5}\left(\frac{1}{5}\right)=-1 \) (vi) \( \log _{a} 1=0 \)nAns.n(i) \( \log _{3} 81=4 \quad \therefore 3^{4}=81 \)n(ii) \( \log _{8} 4=\frac{2}{3} \quad \therefore 8^{\frac{2}{3}}=4 \)n(iii) \( \log _{2} \frac{1}{8}=-3 \quad \therefore \quad 2^{-3}=\frac{1}{8} \)n(iv) \( \log _{10}(0.01)=-2 \quad \therefore \quad 10^{-2}=0.01 \)n(v) \( \log _{5}\left(\frac{1}{5}\right)=-1 \quad \therefore \quad 5^{-1}=\frac{1}{5} \)n(vi) \( \log _{a} 1=0 \)n\( \therefore a^{0}=1 \)nMath Class IXn1nQuestion Bank'
Answers:
No shame in just doing the dumb solution:
^(d+.|Q d+.|Q d+|Q.d+.|d+|Question d+|Example( d+)?|Problem d+|Problem:|[A-Z]d.|EXAMPLE)s+
You can use
(?m)^(?!$)(?:((?i:Question|Problem:?|Example)|[A-Z])[. ]?)?(d+[. ]?)?(?=s)
See the regex demo.
Details:
(?m)^
– start of a line (m
allows ^
to match any line start position)
(?!$)
– no end of line allowed at the same location (i.e. no empty line match allowed)
(?:((?i:Question|Problem:?|Example)|[A-Z])[. ]?)?
– an optional sequence of
((?i:Question|Problem:?|Example)|[A-Z])
– Group 1: Question
, Problem
, Problem:
or Example
case insensitively, or an uppercase letter
[. ]?
– a space or .
(d+[. ]?)?
– an optional capturing group with ID 2 matching one or more digits and then an optional .
or space
(?=s)
– a positive lookahead that requires a whitespace char immediately to the right of the current location.
I have some patterns which detect questions and splits on top of that. there are some assumptions which I’m using like:
- Every pattern starts with a
n
- Every pattern ends with
s+
And how I define a pattern is like:
<NUM>.
Q <NUM>.
Q <NUM>
<Q.NUM.>
<NUM>
Question <NUM>
<Example>
Problem <NUM>
Problem:
<Alphabet><Number>.
<EXAMPLE>
Example <NUM>
Someone suggested the below regex: try the demo
((Q|Question|Problem:?|Example|EXAMPLE).? ?d+.? ?|(Question|Problem:?|Example|EXAMPLE) ?)
but it captures patterns in the middle which is problematic for me because I can have Q.
, Example. 2
in the middle of the string too and is not capturing <NUM>.
This list is based on priority so what I could come up with is building these many expressions and running a loop based on the priority for example:
QUESTIONS = [
re.compile("nd+."),
re.compile("nQ.s*d+."),
re.compile("nExample.s*d+.")
]
but it is very inefficient. How can I club these in one expression?
HERE IS THE TEST STRING:
'TEStlabZnEDULABZnINTERNATIONALnLOGARITHMS AND INDICESnnQ.1. (A) Convert each of the following to logarithmic form.n(i) \( 5^{2}=25 \)n(ii) \( 3^{-3}=\frac{1}{27} \)n(iii) \( (64)^{\frac{1}{3}}=4 \)n(iv) \( 6^{0}=1 \)n(v) \( 10^{-2}=0.01 \) (vi) \( 4^{-1}=\frac{1}{4} \)nAns. We know that \( a^{b}=x \Rightarrow b=\log _{a} x \)n(i) \( 5^{2}=25 \quad \therefore \log _{5} 25=2 \)n(ii) \( 3^{-3}=\frac{1}{27} \therefore \log _{3}\left(\frac{1}{27}\right)=-3 \)n(iii) \( (64)^{\frac{1}{3}}=4 \therefore \log _{64} 4=\frac{1}{3} \)n(iv) \( 6^{0}=1 \quad \therefore \log _{6} 1=0 \)n(v) \( 10^{-2}=0.01 \therefore \log _{10}(0.01)=-2 \)n(vi) \( 4^{-1}=\frac{1}{4} \therefore \log _{4}\left(\frac{1}{4}\right)=-1 \)nQ.1. (B) Convert each of the following to exponential form.n(i) \( \log _{3} 81=4 \)n(ii) \( \log _{8} 4=\frac{2}{3} \)n(iii) \( \log _{2} \frac{1}{8}=-3 \)n(iv) \( \log _{10}(0.01)=-2 \)n(v) \( \log _{5}\left(\frac{1}{5}\right)=-1 \) (vi) \( \log _{a} 1=0 \)nAns.n(i) \( \log _{3} 81=4 \quad \therefore 3^{4}=81 \)n(ii) \( \log _{8} 4=\frac{2}{3} \quad \therefore 8^{\frac{2}{3}}=4 \)n(iii) \( \log _{2} \frac{1}{8}=-3 \quad \therefore \quad 2^{-3}=\frac{1}{8} \)n(iv) \( \log _{10}(0.01)=-2 \quad \therefore \quad 10^{-2}=0.01 \)n(v) \( \log _{5}\left(\frac{1}{5}\right)=-1 \quad \therefore \quad 5^{-1}=\frac{1}{5} \)n(vi) \( \log _{a} 1=0 \)n\( \therefore a^{0}=1 \)nMath Class IXn1nQuestion Bank'
No shame in just doing the dumb solution:
^(d+.|Q d+.|Q d+|Q.d+.|d+|Question d+|Example( d+)?|Problem d+|Problem:|[A-Z]d.|EXAMPLE)s+
You can use
(?m)^(?!$)(?:((?i:Question|Problem:?|Example)|[A-Z])[. ]?)?(d+[. ]?)?(?=s)
See the regex demo.
Details:
(?m)^
– start of a line (m
allows^
to match any line start position)(?!$)
– no end of line allowed at the same location (i.e. no empty line match allowed)(?:((?i:Question|Problem:?|Example)|[A-Z])[. ]?)?
– an optional sequence of((?i:Question|Problem:?|Example)|[A-Z])
– Group 1:Question
,Problem
,Problem:
orExample
case insensitively, or an uppercase letter[. ]?
– a space or.
(d+[. ]?)?
– an optional capturing group with ID 2 matching one or more digits and then an optional.
or space(?=s)
– a positive lookahead that requires a whitespace char immediately to the right of the current location.