Matching everything except for a character followed by a newline
Question:
This seems like a simple match, but I’m unable to figure out how to match all text that starts with a known block of text and ends with a semicolon + newline. What I have right now mostly works:
pattern = r'''[ ]+(value w+n)([^;]+)'''
For an example section of text that allows me to parse:
value Y1N5NALC
1 = 'Yes'
5 = 'No'
7 = 'Not ascertained' ;
value AGESCRN
15 = '15 years'
16 = '16 years';
However, if any of the key/value pairs contain a semicolon in the string the match fails early since the regex is looking for any semicolon. An example:
value Y1N5NALC
1 = 'Yes'
5 = 'No;Maybe'
7 = 'Not ascertained' ;
What I’d like to do is end the match by looking for a semicolon
+ Optional(space or tab)
+ newline
. Using ([^;n]+)
fails since the newline gets match to the negative.
Answers:
You can use
(?sm)^ +(value w+n)(.*?);$
See the regex demo.
Details:
(?sm)
– re.S
and re.M
are on
^
– start of a line
+
– one or more spaces
(value w+r?n)
– Group 1: value
, space, one or more word chars, and and an LF line break
(.*?)
– Group 2:
;
– a ;
$
– at the end of a line.
In case there can be CRLF endings, you need
(?sm)^ +(value w+r?n)(.*?);r?$
This seems like a simple match, but I’m unable to figure out how to match all text that starts with a known block of text and ends with a semicolon + newline. What I have right now mostly works:
pattern = r'''[ ]+(value w+n)([^;]+)'''
For an example section of text that allows me to parse:
value Y1N5NALC
1 = 'Yes'
5 = 'No'
7 = 'Not ascertained' ;
value AGESCRN
15 = '15 years'
16 = '16 years';
However, if any of the key/value pairs contain a semicolon in the string the match fails early since the regex is looking for any semicolon. An example:
value Y1N5NALC
1 = 'Yes'
5 = 'No;Maybe'
7 = 'Not ascertained' ;
What I’d like to do is end the match by looking for a semicolon
+ Optional(space or tab)
+ newline
. Using ([^;n]+)
fails since the newline gets match to the negative.
You can use
(?sm)^ +(value w+n)(.*?);$
See the regex demo.
Details:
(?sm)
–re.S
andre.M
are on^
– start of a line+
– one or more spaces(value w+r?n)
– Group 1:value
, space, one or more word chars, and and an LF line break(.*?)
– Group 2:;
– a;
$
– at the end of a line.
In case there can be CRLF endings, you need
(?sm)^ +(value w+r?n)(.*?);r?$