Why it missing part of parentheses when matching multi-character 0 or 1 time in python regex?
Question:
I only know how to match one character 0 or 1 time in regex, for example
content = "abc"
print(re.match(r'abc?', content)) #true
content = "ab"
print(re.match(r'abc?', content)) #true
Now there are two actual situations
content = "民国4年(1915年)2至3月" #include parentheses
#content = "民国4年2至3月" #not include
print(re.match(r'.*年((.{1,5}))?', content).group())
The problem is the actual result is 民国4年(1915年
I don’t know why it missing the right parentheses.
Answers:
You can fix your immediate problem by making the .*
in your pattern lazy:
# -*- coding: utf-8 -*-
content = "民国4年(1915年)2至3月"
print(re.match(r'.*?年((.*?))?', content).group())
# 民国4年(1915年)
.*年
is greedy and matches 民国4年(1915年
all by itself by matching everything up to the last 年
. With the trailing ?
in ((.{1,5}))?
it makes matching the string (1915年)
optional, so the final result is only what was matched by .*年
.
Make .*年
non-greedy by using .*?年
and it will only match up to the first 年
:
import re
content1 = "民国4年(1915年)2至3月" #include parentheses
content2 = "民国4年2至3月" # not include
print(re.match(r'.*?年((.{1,5}))?', content1).group())
print(re.match(r'.*?年((.{1,5}))?', content2).group())
Output:
民国4年(1915年)
民国4年
Here the result 民国4年(1915年
is the match of r'.*年
, there is nothing related to your later match pattern ((.*?))?
.
You can change your match pattern to r'.{1,4}年((.*?))?'
, so it will not match the later character 年
.
I only know how to match one character 0 or 1 time in regex, for example
content = "abc"
print(re.match(r'abc?', content)) #true
content = "ab"
print(re.match(r'abc?', content)) #true
Now there are two actual situations
content = "民国4年(1915年)2至3月" #include parentheses
#content = "民国4年2至3月" #not include
print(re.match(r'.*年((.{1,5}))?', content).group())
The problem is the actual result is 民国4年(1915年
I don’t know why it missing the right parentheses.
You can fix your immediate problem by making the .*
in your pattern lazy:
# -*- coding: utf-8 -*-
content = "民国4年(1915年)2至3月"
print(re.match(r'.*?年((.*?))?', content).group())
# 民国4年(1915年)
.*年
is greedy and matches 民国4年(1915年
all by itself by matching everything up to the last 年
. With the trailing ?
in ((.{1,5}))?
it makes matching the string (1915年)
optional, so the final result is only what was matched by .*年
.
Make .*年
non-greedy by using .*?年
and it will only match up to the first 年
:
import re
content1 = "民国4年(1915年)2至3月" #include parentheses
content2 = "民国4年2至3月" # not include
print(re.match(r'.*?年((.{1,5}))?', content1).group())
print(re.match(r'.*?年((.{1,5}))?', content2).group())
Output:
民国4年(1915年)
民国4年
Here the result 民国4年(1915年
is the match of r'.*年
, there is nothing related to your later match pattern ((.*?))?
.
You can change your match pattern to r'.{1,4}年((.*?))?'
, so it will not match the later character 年
.