Find a change level and a text of the change in a .rst news files using regex

Question

What I want

I’m trying to work out a way in which I can use regex to find two groups in RST news files. I want get change level as well as the change text, for instance a following .rst file:

hence I want a following regex (changelevel): (change text)
I was thinking about something like (changelevel): (anything until no next change level)

* Major: This is a **Major** change
* Minnor: This is is a minor change with a typo
* Patch: This
is a multiline
  patch

Should return a match, group1 and group2 as following

Match 1:

"* Major: This is a **Major** change"
"* Major: "
"This is a major **Major** change"

Match 2:

"* Patch: Thisnis a multilinen  patch"
"* Patch: "
"Thisnis a multilinen  patch

What I need help with

I cannot make a regex that will take care of multilines and asterisks present in the "change text"
I tried following logic

Match the change level ^(*s+(w+):s)
Match anything – with "dot matches newline" option turned on" .*
Negative forward lookup until I match the change level (?!^(*s+(w+):s))

I ended up with ^(*s+(w+):s).*(?!^(*s+(w+):s)) but .* seems to just match everything to group 2

What works

I managed to get the first group working with a following regex which works works:

beginning of the line
star in front
then whitespace
a word
colon
white space

^(*s+(w+):s)

Asked By: Bartek Lachowicz

||

Source

Answer 1

re.findall(r'(*s*w+:s*)([sS]*?(?=n*s*w+:s*|$))',text)

Use newline followed by * or end of string $ as a anchor
Group 1: A literal * followed by zero or more spaces and any word character, a literal : and one or more spaces
Group 2: Match everything non greedily *? upto n*s*w+:s*(like Group 1) or $

Answered By: TheMaster

Answer 2

You are almost there, you can write the pattern using the lookahead and introduce matching a newline and if the assertions succeeds, then match the whole line.

^(*s+w+:s)(.*(?:n(?!*s+w+:s).*)*)

Explanation

^ Start of string
( Capture group 1
- *s+w+:s match *, 1+ whitespace chars, 1+ word chars, : and a whitespace char
) Close group 1
( Capture group 2
- .* Match the whole line
- (?: Non capture group to repeat as a whole
- n Match a newline
  - (?!*s+w+:s) The negative lookahead, asserting not the starting pattern here
  - .* Match the whole line
- )* Close the non capture group and optionally repeat it to match alles lines
) Close group 2

See a regex demo and a Python demo.

Example code:

import re
 
pattern = r"^(*s+w+:s)(.*(?:n(?!*s+w+:s).*)*)"
 
s = ("* Major: This is a **Major** changen"
    "* Minnor: This is is a minor change with a typon"
    "* Patch: Thisn"
    "is a multilinen"
    "  patch")
 
result = re.findall(pattern, s, re.MULTILINE)
print(result)

Output

[('* Major: ', 'This is a **Major** change'), ('* Minnor: ', 'This is is a minor change with a typo'), ('* Patch: ', 'Thisnis a multilinen  patch')]

Answered By: The fourth bird