Parsing dates in Different format from Text

Question:

i have a dataframe where within the raw text column certain text with Dates in different format is given. i am looking to extract this dates in separate column

sample Raw Text :

"Sales Assistant @ DFS Duration – June 2021 – 2023 Currently
working in XYZ Within the role I am expected to achieve sales targets
which I currently have no problems reaching. Job Role/Establishment –
Plasterer @ XX Plasterer’s Duration – September 2016 – Nov 2016
Job Role/Establishment – Customer Advisor @ AA Duration – (2015 –
2016)
Job Role/Establishment – Warehouse Operative @ xyz Duration –
03/2014 to 08/2015 In the xyz warehouse Job Role/Establishment – Airport Terminal Assistant @ port Duration – 01/2012 – 06/2013
Working at the airport . Job Role/Establishment – Apprentice Floorer @
YY Floors Duration – DEC 2010 – APRIL 2012 "

Expected Dataframe :

id      Raw_text                   Dates
01     "sample_raw_text"         June 2021 - 2023 , September 2016 - Nov 2016,(2015 – 2016),03/2014 to 08/2015 , 01/2012 - 06/2013, DEC 2010 – APRIL 2012

I have Tried below pattern :

def extract_dates(df, column):
    # Define the regex pattern to match dates in different month formats
    pattern = r'(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,s]*d{1,2}[-,s]*(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,s]*d{2,4}s*[-–]s*(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,s]*d{1,2}[-,s]*(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,s]*d{2,4}'

    # Extract the dates from the specified column
    df['Dates'] = df[column].str.extract(pattern)

with above i am unable to fetch required output. please guide what am i missing

Asked By: Roshankumar

||

Answers:

Try this:

(?(?:b[A-Za-z]{3,9}s*)?(?:dd?/){0,2}[12]d{3})?s*(?:–|-|[Tt][Oo])s*(?(?:[A-Za-z]{3,9}s*)?(?:dd?/){0,2}[12]d{3})?|(s*[A-Za-z]{3,9}s*[–-]s*[A-Za-z]{3,9}s*[12]d{3}s*)
  • (? an optional (.

  • (?:[A-Za-z]{3,9}s*)? non-capturing gruop.

    • [A-Za-z]{3,9} between 3-9 letters.
    • s* zero or more whitespace character.
    • ? makes the whole group optinal.
  • (?:dd/)? non-caputring group.

    • d a digit between 0-9.
    • d another digit between 0-9.
    • / a literal forward slash /.
  • [12]d{3}s*

    • [12] match one digit from the listed digits 1 or 2.
    • d{3} three digits between 0-9
    • s* zero or more whitespace character.
  • (?:–|-|[Tt][Oo])s*

    • (?:–|-|[Tt][Oo]) match , -, TO, to, To or tO.
    • s* zero or more whitespace character.
  • (?:[A-Za-z]{3,9}s*)? explained above.

  • (?:dd/)? explained above.

  • [12]d{3} explained above.

  • )? an optional ).

See regex demo

Answered By: SaSkY
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.