Python Regex to Capture Proceeding Text – mixing cas insensitivity in group


Example Link

RegEx Group returning issue:

(?P<qa_type>(Q|A|Mr[.|:]? [a-z]+|Mrs[.|:]? [a-z]+|Ms[.|:]? [a-z]+|Miss[.|:]? [a-z]+|Dr[.|:]? [a-z]+))?([.|:|s]+)?

To extract text from proceeding transcript pdfs for each question/answer/speaker type.

Using Python: interage through pages in PDF extracted text and group Qestion/Answer text.

Desired Results = qa_type, page_start, page_end, line_num_start, line_num_end, qa_text

For the [Q|A] designators, I only want upper case, but for the speaker Titles (Mr, Mrs., Dr., etc.) case insensitive is required, both Q|A and spearker salutation a single ‘qa_type’ group.

Request: How do I prevent ‘qa_type’ from captureing ‘a’ or ‘q’? See lines 2 and 17 on pp 275.

Example bad extract – line 17 ‘a’

regex = r"(^(?P<line_num>[1-9]|1[0-9]|2[0-2])b +)(?P<qa_type>(Q|A|Mr[.|:]? [a-z]+|Mrs[.|:]? [a-z]+|Ms[.|:]? [a-z]+|Miss[.|:]? [a-z]+|Dr[.|:]? [a-z]+))?([.|:|s]+)?(?P<type_text>b.*)|page (?P<page_num>d{1,3})"
Asked By: rnwtenor



This sounds pretty similar to this question. Unfortunately, it seems like python inline flag modifiers have been deprecated. You can still try to use them, in which case your regex would look like this (without the global case-insensitive flag):

(^(?P<line_num>[1-9]|1[0-9]|2[0-2])b +)(?P<qa_type>(Q|A|(?i)Mr[.|:]? [a-z]+|Mrs[.|:]? [a-z]+|Ms[.|:]? [a-z]+|Miss[.|:]? [a-z]+|Dr[.|:]? [a-z]+(?-i)))?([.|:|s]+)?(?P<type_text>b.*)|(?i)page(?-i) (?P<page_num>d{1,3})

The alternative is to just specify both the lowercase and uppercase characters every time you want a case-insensitive letter (again, without the global case-insensitive flag):

(^(?P<line_num>[1-9]|1[0-9]|2[0-2])b +)(?P<qa_type>(Q|A|[mM][rR][.|:]? [a-zA-Z]+|[mM][rR][sS][.|:]? [a-zA-Z]+|[mM][sS][.|:]? [a-zA-Z]+|[mM][iI][sS][sS][.|:]? [a-zA-Z]+|[dD][rR][.|:]? [a-zA-Z]+))?([.|:|s]+)?(?P<type_text>b.*)|[pP][aA][gG][eE] (?P<page_num>d{1,3})

Updated regex101 link

Answered By: rpm
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.