Why it missing part of parentheses when matching multi-character 0 or 1 time in python regex?

Question

I only know how to match one character 0 or 1 time in regex, for example

content = "abc"
print(re.match(r'abc?', content)) #true
content = "ab"
print(re.match(r'abc?', content)) #true

Now there are two actual situations

content = "民国4年(1915年)2至3月" #include parentheses
#content = "民国4年2至3月" #not include
print(re.match(r'.*年((.{1,5}))?', content).group())

The problem is the actual result is 民国4年(1915年 I don’t know why it missing the right parentheses.

Asked By: 4daJKong

||

Source

Answer 1

You can fix your immediate problem by making the .* in your pattern lazy:

# -*- coding: utf-8 -*-

content = "民国4年(1915年)2至3月"
print(re.match(r'.*?年((.*?))?', content).group())

# 民国4年(1915年)

Answered By: Tim Biegeleisen

Answer 2

.*年 is greedy and matches 民国4年(1915年 all by itself by matching everything up to the last 年. With the trailing ? in ((.{1,5}))? it makes matching the string (1915年) optional, so the final result is only what was matched by .*年.

Make .*年 non-greedy by using .*?年 and it will only match up to the first 年:

import re

content1 = "民国4年(1915年)2至3月" #include parentheses
content2 = "民国4年2至3月" # not include

print(re.match(r'.*?年((.{1,5}))?', content1).group())
print(re.match(r'.*?年((.{1,5}))?', content2).group())

Output:

民国4年(1915年)
民国4年

Answered By: Mark Tolonen

Answer 3

Here the result 民国4年(1915年 is the match of r'.*年, there is nothing related to your later match pattern ((.*?))?.
You can change your match pattern to r'.{1,4}年((.*?))?', so it will not match the later character 年.

Answered By: ramwin

Why it missing part of parentheses when matching multi-character 0 or 1 time in python regex?

Question:

Answers: