Regular Expression, using non-greedy to catch optional string

Question:

I am parsing the content of a PDF with PDFMiner and sometimes, there is a line that is present and other time not. I am trying to express the optional line without any success. Here is a piece of code that shows the problem:

#!/usr/bin/python3
# coding=UTF8

import re

# Simulate reading text of a PDF file with PDFMiner.
pdfContent = """

Blah blah.

Date:  2022-01-31

Optional line here which sometimes does not show

Amount:  123.45

2: Blah blah.

"""

RE = re.compile(
    r".*?"
    "Date:s+(S+).*?"
    "(Optional line here which sometimes does not show){0,1}.*?"
    "Amount:s+(?P<amount>S+)n.*?"
    , re.MULTILINE | re.DOTALL)

matches = RE.match(pdfContent)

date     = matches.group(1)
optional = matches.group(2)
amount   = matches.group("amount")

print(f"date     = {date}")
print(f"optional = {optional}")
print(f"amount   = {amount}")

The output is:

date     = 2022-01-31
optional = None
amount   = 123.45

Why is optional None? Notice that if I replace the {0,1} with {1}, it works! But, then the line is not optional anymore. I do use the non-greedy .*? everywhere…

This is probably a duplicate, but I searched and searched and did not find my answer, thus this question.

Asked By: Hans Deragon

||

Answers:

You can use re.search (instead of re.match) with

Date:s+(S+)(?:.*?(Optional line here which sometimes does not show))?.*?Amount:s+(?P<amount>S+)

See the regex demo.

In this pattern, .*?(Optional line here which sometimes does not show)? ({0,1} = ?) is wrapped with an optional non-capturing group, (?:...)?, that must be tried at least once since ? is a greedy quantifier.

In your code, you can use it as

RE = re.compile(
    r"Date:s+(S+)(?:.*?"
    r"(Optional line here which sometimes does not show))?.*?"
    r"Amount:s+(?P<amount>S+)",
    re.DOTALL)

matches = RE.search(pdfContent)

See the Python demo:

import re
 
pdfContent = "nnBlah blah.nnDate:  2022-01-31nnOptional line here which sometimes does not shownnAmount:  123.45nn2: Blah blah.n"
 
RE = re.compile(
    r"Date:s+(S+)(?:.*?"
    r"(Optional line here which sometimes does not show))?.*?"
    r"Amount:s+(?P<amount>S+)",
    re.DOTALL)
 
matches = RE.search(pdfContent)
date     = matches.group(1)
optional = matches.group(2)
amount   = matches.group("amount")
 
print(f"date     = {date}")
print(f"optional = {optional}")
print(f"amount   = {amount}")

Output:

date     = 2022-01-31
optional = Optional line here which sometimes does not show
amount   = 123.45
Answered By: Wiktor Stribiżew
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.