Matching everything except for a character followed by a newline

Question:

This seems like a simple match, but I’m unable to figure out how to match all text that starts with a known block of text and ends with a semicolon + newline. What I have right now mostly works:

pattern = r'''[ ]+(value w+n)([^;]+)'''

For an example section of text that allows me to parse:

   value Y1N5NALC
      1 = 'Yes'  
      5 = 'No'  
      7 = 'Not ascertained' ;
   value AGESCRN
      15 = '15 years'  
      16 = '16 years';  

However, if any of the key/value pairs contain a semicolon in the string the match fails early since the regex is looking for any semicolon. An example:

   value Y1N5NALC
      1 = 'Yes'  
      5 = 'No;Maybe'  
      7 = 'Not ascertained' ;

What I’d like to do is end the match by looking for a semicolon + Optional(space or tab) + newline. Using ([^;n]+) fails since the newline gets match to the negative.

Asked By: Hooked

||

Answers:

You can use

(?sm)^ +(value w+n)(.*?);$

See the regex demo.

Details:

  • (?sm)re.S and re.M are on
  • ^ – start of a line
  • + – one or more spaces
  • (value w+r?n) – Group 1: value, space, one or more word chars, and and an LF line break
  • (.*?) – Group 2:
  • ; – a ;
  • $ – at the end of a line.

In case there can be CRLF endings, you need

(?sm)^ +(value w+r?n)(.*?);r?$
Answered By: Wiktor Stribiżew
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.