Why this regex pattern fails to match a given string and how to correct it?

Question:

I want to capture all characters using python regex which satisfy one of the three conditions described below.

(~ means zero or more characters)

[pattern1] NAME_ ”words_or_numbers” AGE_ my_num ~;

[pattern2] NAME_ ”words_or_numbers” DESC_ my_num ~;

[pattern3] NAME_ADD_ ”words_or_numbers” CHAR_DESC_ADD_ word_or_numbers_or_underscore DESC_ my_num ~;

For [pattern1], [pattern2], [pattern3], I’d like to find only the text that matches the given my_num. For example, the example below indicates that I picked 373 and 416 as the my_num values.

(Note that each pattern can contain multiline characters)

Original Text:

NAME_ "Hello" AGE_ 373 0;
NAME_ "Summer" AGE_ 340 0;
NAME_ "Sam" AGE_ 416 14;
NAME_ "Edward" DESC_ 373 ABC_DEF_G "These are users.

age, description

- example(0x15) , Isfalse : 0xF+df

- safe.

- (t) = + 1";
NAME_ "Alex" DESC_ 373 asdf 65535;
NAME_ADD_ "Crystal" CHAR_DESC_ADD_ GGE_R DESC_ 373 ABCD 340;
NAME_ "Ray" DESC_ 111 asdfs 3;
NAME_ "Brown" DESC_ 416 asdfs 3;
NAME_ADD_ "Hailey" CHAR_DESC_ADD_ GGE3 DESC_ 416 ABCD 120;
NAME_ "Watson" AGE_ 373 0;
NOT_NAME_ 324 XYZ 22 "A" 1 "B" 2 "C" 3 "R" ;

Desired Output:

NAME_ "Hello" AGE_ 373 0;
NAME_ "Sam" AGE_ 416 14;
NAME_ "Edward" DESC_ 373 ABC_DEF_G "These are users.

age, description

- example(0x15) , Isfalse : 0xF+df

- safe.

- (t) = + 1";
NAME_ "Alex" DESC_ 373 asdf 65535;
NAME_ADD_ "Crystal" CHAR_DESC_ADD_ GGE_R DESC_ 373 ABCD 340;
NAME_ "Brown" DESC_ 416 asdfs 3;
NAME_ADD_ "Hailey" CHAR_DESC_ADD_ GGE3 DESC_ 416 ABCD 120;
NAME_ "Watson" AGE_ 373 0;

I’ve tried using regex like (with re.findall method):

(?s)((NAME_ .+ (AGE_|DESC_) (373|416) .?(?=NAME_|NOT_NAME_|$))|(NAME_ADD_ .+ CHAR_DESC_ADD_ .+ DESC_ (373|416) .?(?=NAME_|NOT_NAME_|$)))

but it captured nothing. What’s wrong with my attempt, and how can this be done properly?

Asked By: Stella

||

Answers:

The main problem I see with the regex is that you only match space and single optional character after the my_num. In your original text there is no sequence that matches this, so that is why the result is empty. Also the .+ should be changed to exclude the ; character, otherwise the regex could match the whole file as long as the first and last few of characters together match one of the patterns.

You could change the .+ to [^;]+ and the .? after my_num to [^;]*;. The [^;] matches any character that is not ;. Also if you do this the lookahead assertion (?=NAME_|NOT_NAME_|$) is not needed. The new regex could look like this:

(?s)((NAME_ [^;]+ (AGE_|DESC_) (373|416) [^;]*;)|(NAME_ADD_ [^;]+ CHAR_DESC_ADD_ [^;]+ DESC_ (373|416) [^;]*;))
Answered By: m77m77
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.