Python regex doesnt match 1 occurrence with 0 or 1 occurrences operator?

Question:

I have date strings of the following forms
‘8 april 2022’, ‘8 april’, ‘april’
and a regex to try and match any of them

re.findall(r"(d{1,2})?.*(januari|februari|maart|april|mei|juni|juli|augustus|september|oktober|november|december).*(202d)?", str)

the problem is, it will return ('8', 'april', '') in case of str = '8 april 2022'
so my question is: why does ? ignore 1 occurrence of 202d when its there?
Thank you.

EDIT. With non greedy .*?

re.findall(r"(d{1,2}).*?(januari|februari|maart|april|mei|juni|juli|augustus|september|oktober|november|december).*?(202d)?", str)

it still doesnt capture 2022

EDIT 2. Considering the answers a better question would be:
Is there a way of saying ‘hey regex 1 occurrence is optional but preferable to 0’ ?

Asked By: Sev

||

Answers:

.* should be rarely used due to the greediness .* after matching month is matching too much and not leaving anything to match in 3rd capture group for year. Also you just need to match 1+ spaces between strings. It is important to make part between month and year optional by using a non-capture group as shown below.

You may use this regex with non-optional matches, word boundary and bit of tweaking:

b(?:(d{1,2}) +)?(januari|februari|maart|april|mei|juni|juli|augustus|september|oktober|november|december)(?: +(202d))?

RegEx Demo

Answered By: anubhava

The .* matches " 2022" and then the (202d)? matches "", as it’s optional and there’s nothing left.

The .*? matches "" and then the (202d)? matches "", as it’s optional and the remaining " 2022" doesn’t even start with 2.

You wish it would search further so that the (202d)? matches the "2022", but why should it search further? It already found a match, so it stops and reports that.

Answered By: Kelly Bundy

On the last part of your regex pattern .*(202d)?, the 2022 is captured by the .* and consequently (202d) captured nothing.

This is for your perusal, but may not be exactly as you wanted.

matches = re.findall(r"(?:d{0,2}s*)(?:januari|februari|maart|april|mei|juni|juli|augustus|september|oktober|november|december)(?:s202d)?", str)

For 3 mei woensdag 2022, this may not be what you wanted exactly but it should work for the year:

matches = re.findall(r"(?:d{0,2}s*)(?:w+s*)+(?:s*202d)?", str)
Answered By: Jobo Fernandez
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.