Why it missing part of parentheses when matching multi-character 0 or 1 time in python regex?

Question:

I only know how to match one character 0 or 1 time in regex, for example

content = "abc"
print(re.match(r'abc?', content)) #true
content = "ab"
print(re.match(r'abc?', content)) #true

Now there are two actual situations

content = "民国4年(1915年)2至3月" #include parentheses
#content = "民国4年2至3月" #not include
print(re.match(r'.*年((.{1,5}))?', content).group())

The problem is the actual result is 民国4年(1915年 I don’t know why it missing the right parentheses.

Asked By: 4daJKong

||

Answers:

You can fix your immediate problem by making the .* in your pattern lazy:

# -*- coding: utf-8 -*-

content = "民国4年(1915年)2至3月"
print(re.match(r'.*?年((.*?))?', content).group())

# 民国4年(1915年)
Answered By: Tim Biegeleisen

.*年 is greedy and matches 民国4年(1915年 all by itself by matching everything up to the last . With the trailing ? in ((.{1,5}))? it makes matching the string (1915年) optional, so the final result is only what was matched by .*年.

Make .*年 non-greedy by using .*?年 and it will only match up to the first :

import re

content1 = "民国4年(1915年)2至3月" #include parentheses
content2 = "民国4年2至3月" # not include

print(re.match(r'.*?年((.{1,5}))?', content1).group())
print(re.match(r'.*?年((.{1,5}))?', content2).group())

Output:

民国4年(1915年)
民国4年
Answered By: Mark Tolonen

Here the result 民国4年(1915年 is the match of r'.*年, there is nothing related to your later match pattern ((.*?))?.
You can change your match pattern to r'.{1,4}年((.*?))?', so it will not match the later character .

Answered By: ramwin
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.