Python regex not capturing groups properly

Question:

I have the following regex (?:RE:w+|Reference:)s*((Mr|Mrs|Ms|Miss)?s+([w-]+)s(w+)).

Input text examples:

  1. RE:11567 Miss Jane Doe 12345678
  2. Reference: Miss Jane Doe 12345678
  3. RE:J123 Miss Jane Doe 12345678
  4. RE:J123 Miss Jane Doe 12345678 Reference: Test Company

Sample Code:

import re

pattern = re.compile('(?:RE:w+|Reference:)s*((Mr|Mrs|Ms|Miss)?s+([w-]+)s(w+))')
result = pattern.findall('RE:11693 Miss Jane Doe 12345678')

For all 4 I expect the output ('Miss Jane Doe', 'Miss', 'Jane', 'Doe'). However in 4th text example I get [('Miss Jane Doe', 'Miss', 'Jane', 'Doe'), (' Test Company', '', 'Test', 'Company')]

How can I get the correct output

Asked By: West

||

Answers:

Just add ^ to the start of the regex to only match at the start. This makes it
^(?:RE:w+|Reference:)s*((Mr|Mrs|Ms|Miss)?s+([w-]+)s(w+)).

Answered By: Gamma032
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.