Find a change level and a text of the change in a .rst news files using regex
Question:
What I want
I’m trying to work out a way in which I can use regex to find two groups in RST news files. I want get change level as well as the change text, for instance a following .rst
file:
- hence I want a following regex (changelevel): (change text)
- I was thinking about something like (changelevel): (anything until no next change level)
* Major: This is a **Major** change
* Minnor: This is is a minor change with a typo
* Patch: This
is a multiline
patch
Should return a match, group1 and group2 as following
Match 1:
"* Major: This is a **Major** change"
"* Major: "
"This is a major **Major** change"
Match 2:
"* Patch: Thisnis a multilinen patch"
"* Patch: "
"Thisnis a multilinen patch
What I need help with
I cannot make a regex that will take care of multilines and asterisks present in the "change text"
I tried following logic
- Match the change level
^(*s+(w+):s)
- Match anything – with "dot matches newline" option turned on"
.*
- Negative forward lookup until I match the change level
(?!^(*s+(w+):s))
- I ended up with
^(*s+(w+):s).*(?!^(*s+(w+):s))
but .*
seems to just match everything to group 2
What works
I managed to get the first group working with a following regex which works works:
- beginning of the line
- star in front
- then whitespace
- a word
- colon
- white space
^(*s+(w+):s)
Answers:
re.findall(r'(*s*w+:s*)([sS]*?(?=n*s*w+:s*|$))',text)
-
Use n
ewline followed by *
or end of string $
as a anchor
-
Group 1: A literal *
followed by zero or more s
paces and any w
ord character, a literal :
and one or more s
paces
-
Group 2: Match everything non greedily *?
upto n*s*w+:s*
(like Group 1) or $
You are almost there, you can write the pattern using the lookahead and introduce matching a newline and if the assertions succeeds, then match the whole line.
^(*s+w+:s)(.*(?:n(?!*s+w+:s).*)*)
Explanation
^
Start of string
(
Capture group 1
*s+w+:s
match *
, 1+ whitespace chars, 1+ word chars, :
and a whitespace char
)
Close group 1
(
Capture group 2
.*
Match the whole line
(?:
Non capture group to repeat as a whole
n
Match a newline
(?!*s+w+:s)
The negative lookahead, asserting not the starting pattern here
.*
Match the whole line
)*
Close the non capture group and optionally repeat it to match alles lines
)
Close group 2
See a regex demo and a Python demo.
Example code:
import re
pattern = r"^(*s+w+:s)(.*(?:n(?!*s+w+:s).*)*)"
s = ("* Major: This is a **Major** changen"
"* Minnor: This is is a minor change with a typon"
"* Patch: Thisn"
"is a multilinen"
" patch")
result = re.findall(pattern, s, re.MULTILINE)
print(result)
Output
[('* Major: ', 'This is a **Major** change'), ('* Minnor: ', 'This is is a minor change with a typo'), ('* Patch: ', 'Thisnis a multilinen patch')]
What I want
I’m trying to work out a way in which I can use regex to find two groups in RST news files. I want get change level as well as the change text, for instance a following .rst
file:
- hence I want a following regex (changelevel): (change text)
- I was thinking about something like (changelevel): (anything until no next change level)
* Major: This is a **Major** change
* Minnor: This is is a minor change with a typo
* Patch: This
is a multiline
patch
Should return a match, group1 and group2 as following
Match 1:
"* Major: This is a **Major** change"
"* Major: "
"This is a major **Major** change"
Match 2:
"* Patch: Thisnis a multilinen patch"
"* Patch: "
"Thisnis a multilinen patch
What I need help with
I cannot make a regex that will take care of multilines and asterisks present in the "change text"
I tried following logic
- Match the change level
^(*s+(w+):s)
- Match anything – with "dot matches newline" option turned on"
.*
- Negative forward lookup until I match the change level
(?!^(*s+(w+):s))
- I ended up with
^(*s+(w+):s).*(?!^(*s+(w+):s))
but.*
seems to just match everything to group 2
What works
I managed to get the first group working with a following regex which works works:
- beginning of the line
- star in front
- then whitespace
- a word
- colon
- white space
^(*s+(w+):s)
re.findall(r'(*s*w+:s*)([sS]*?(?=n*s*w+:s*|$))',text)
-
Use
n
ewline followed by*
or end of string$
as a anchor -
Group 1: A literal
*
followed by zero or mores
paces and anyw
ord character, a literal:
and one or mores
paces -
Group 2: Match everything non greedily
*?
upton*s*w+:s*
(like Group 1) or$
You are almost there, you can write the pattern using the lookahead and introduce matching a newline and if the assertions succeeds, then match the whole line.
^(*s+w+:s)(.*(?:n(?!*s+w+:s).*)*)
Explanation
^
Start of string(
Capture group 1*s+w+:s
match*
, 1+ whitespace chars, 1+ word chars,:
and a whitespace char
)
Close group 1(
Capture group 2.*
Match the whole line(?:
Non capture group to repeat as a wholen
Match a newline(?!*s+w+:s)
The negative lookahead, asserting not the starting pattern here.*
Match the whole line
)*
Close the non capture group and optionally repeat it to match alles lines
)
Close group 2
See a regex demo and a Python demo.
Example code:
import re
pattern = r"^(*s+w+:s)(.*(?:n(?!*s+w+:s).*)*)"
s = ("* Major: This is a **Major** changen"
"* Minnor: This is is a minor change with a typon"
"* Patch: Thisn"
"is a multilinen"
" patch")
result = re.findall(pattern, s, re.MULTILINE)
print(result)
Output
[('* Major: ', 'This is a **Major** change'), ('* Minnor: ', 'This is is a minor change with a typo'), ('* Patch: ', 'Thisnis a multilinen patch')]