Find a change level and a text of the change in a .rst news files using regex

Question:

What I want

I’m trying to work out a way in which I can use regex to find two groups in RST news files. I want get change level as well as the change text, for instance a following .rst file:

  • hence I want a following regex (changelevel): (change text)
  • I was thinking about something like (changelevel): (anything until no next change level)
* Major: This is a **Major** change
* Minnor: This is is a minor change with a typo
* Patch: This
is a multiline
  patch

Should return a match, group1 and group2 as following

Match 1:

"* Major: This is a **Major** change"
"* Major: "
"This is a major **Major** change"

Match 2:

"* Patch: Thisnis a multilinen  patch"
"* Patch: "
"Thisnis a multilinen  patch

What I need help with

I cannot make a regex that will take care of multilines and asterisks present in the "change text"
I tried following logic

  1. Match the change level ^(*s+(w+):s)
  2. Match anything – with "dot matches newline" option turned on" .*
  3. Negative forward lookup until I match the change level (?!^(*s+(w+):s))
  • I ended up with ^(*s+(w+):s).*(?!^(*s+(w+):s)) but .* seems to just match everything to group 2

enter image description here

What works

I managed to get the first group working with a following regex which works works:

  • beginning of the line
  • star in front
  • then whitespace
  • a word
  • colon
  • white space

^(*s+(w+):s)

enter image description here

Asked By: Bartek Lachowicz

||

Answers:

re.findall(r'(*s*w+:s*)([sS]*?(?=n*s*w+:s*|$))',text)
  • Use newline followed by * or end of string $ as a anchor

  • Group 1: A literal * followed by zero or more spaces and any word character, a literal : and one or more spaces

  • Group 2: Match everything non greedily *? upto n*s*w+:s*(like Group 1) or $

Answered By: TheMaster

You are almost there, you can write the pattern using the lookahead and introduce matching a newline and if the assertions succeeds, then match the whole line.

^(*s+w+:s)(.*(?:n(?!*s+w+:s).*)*)

Explanation

  • ^ Start of string
  • ( Capture group 1
    • *s+w+:s match *, 1+ whitespace chars, 1+ word chars, : and a whitespace char
  • ) Close group 1
  • ( Capture group 2
    • .* Match the whole line
    • (?: Non capture group to repeat as a whole
    • n Match a newline
      • (?!*s+w+:s) The negative lookahead, asserting not the starting pattern here
      • .* Match the whole line
    • )* Close the non capture group and optionally repeat it to match alles lines
  • ) Close group 2

See a regex demo and a Python demo.

Example code:

import re
 
pattern = r"^(*s+w+:s)(.*(?:n(?!*s+w+:s).*)*)"
 
s = ("* Major: This is a **Major** changen"
    "* Minnor: This is is a minor change with a typon"
    "* Patch: Thisn"
    "is a multilinen"
    "  patch")
 
result = re.findall(pattern, s, re.MULTILINE)
print(result)

Output

[('* Major: ', 'This is a **Major** change'), ('* Minnor: ', 'This is is a minor change with a typo'), ('* Patch: ', 'Thisnis a multilinen  patch')]
Answered By: The fourth bird
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.