Parsing dates in Different format from Text
Question:
i have a dataframe where within the raw text column certain text with Dates in different format is given. i am looking to extract this dates in separate column
sample Raw Text :
"Sales Assistant @ DFS Duration – June 2021 – 2023 Currently
working in XYZ Within the role I am expected to achieve sales targets
which I currently have no problems reaching. Job Role/Establishment –
Plasterer @ XX Plasterer’s Duration – September 2016 – Nov 2016
Job Role/Establishment – Customer Advisor @ AA Duration – (2015 –
2016) Job Role/Establishment – Warehouse Operative @ xyz Duration –
03/2014 to 08/2015 In the xyz warehouse Job Role/Establishment – Airport Terminal Assistant @ port Duration – 01/2012 – 06/2013
Working at the airport . Job Role/Establishment – Apprentice Floorer @
YY Floors Duration – DEC 2010 – APRIL 2012 "
Expected Dataframe :
id Raw_text Dates
01 "sample_raw_text" June 2021 - 2023 , September 2016 - Nov 2016,(2015 – 2016),03/2014 to 08/2015 , 01/2012 - 06/2013, DEC 2010 – APRIL 2012
I have Tried below pattern :
def extract_dates(df, column):
# Define the regex pattern to match dates in different month formats
pattern = r'(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,s]*d{1,2}[-,s]*(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,s]*d{2,4}s*[-–]s*(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,s]*d{1,2}[-,s]*(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,s]*d{2,4}'
# Extract the dates from the specified column
df['Dates'] = df[column].str.extract(pattern)
with above i am unable to fetch required output. please guide what am i missing
Answers:
Try this:
(?(?:b[A-Za-z]{3,9}s*)?(?:dd?/){0,2}[12]d{3})?s*(?:–|-|[Tt][Oo])s*(?(?:[A-Za-z]{3,9}s*)?(?:dd?/){0,2}[12]d{3})?|(s*[A-Za-z]{3,9}s*[–-]s*[A-Za-z]{3,9}s*[12]d{3}s*)
-
(?
an optional (
.
-
(?:[A-Za-z]{3,9}s*)?
non-capturing gruop.
[A-Za-z]{3,9}
between 3-9
letters.
s*
zero or more whitespace character.
?
makes the whole group optinal.
-
(?:dd/)?
non-caputring group.
d
a digit between 0-9
.
d
another digit between 0-9
.
/
a literal forward slash /
.
-
[12]d{3}s*
[12]
match one digit from the listed digits 1
or 2
.
d{3}
three digits between 0-9
s*
zero or more whitespace character.
-
(?:–|-|[Tt][Oo])s*
(?:–|-|[Tt][Oo])
match –
, -
, TO
, to
, To
or tO
.
s*
zero or more whitespace character.
-
(?:[A-Za-z]{3,9}s*)?
explained above.
-
(?:dd/)?
explained above.
-
[12]d{3}
explained above.
-
)?
an optional )
.
See regex demo
i have a dataframe where within the raw text column certain text with Dates in different format is given. i am looking to extract this dates in separate column
sample Raw Text :
"Sales Assistant @ DFS Duration – June 2021 – 2023 Currently
working in XYZ Within the role I am expected to achieve sales targets
which I currently have no problems reaching. Job Role/Establishment –
Plasterer @ XX Plasterer’s Duration – September 2016 – Nov 2016
Job Role/Establishment – Customer Advisor @ AA Duration – (2015 –
2016) Job Role/Establishment – Warehouse Operative @ xyz Duration –
03/2014 to 08/2015 In the xyz warehouse Job Role/Establishment – Airport Terminal Assistant @ port Duration – 01/2012 – 06/2013
Working at the airport . Job Role/Establishment – Apprentice Floorer @
YY Floors Duration – DEC 2010 – APRIL 2012 "
Expected Dataframe :
id Raw_text Dates
01 "sample_raw_text" June 2021 - 2023 , September 2016 - Nov 2016,(2015 – 2016),03/2014 to 08/2015 , 01/2012 - 06/2013, DEC 2010 – APRIL 2012
I have Tried below pattern :
def extract_dates(df, column):
# Define the regex pattern to match dates in different month formats
pattern = r'(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,s]*d{1,2}[-,s]*(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,s]*d{2,4}s*[-–]s*(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,s]*d{1,2}[-,s]*(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,s]*d{2,4}'
# Extract the dates from the specified column
df['Dates'] = df[column].str.extract(pattern)
with above i am unable to fetch required output. please guide what am i missing
Try this:
(?(?:b[A-Za-z]{3,9}s*)?(?:dd?/){0,2}[12]d{3})?s*(?:–|-|[Tt][Oo])s*(?(?:[A-Za-z]{3,9}s*)?(?:dd?/){0,2}[12]d{3})?|(s*[A-Za-z]{3,9}s*[–-]s*[A-Za-z]{3,9}s*[12]d{3}s*)
-
(?
an optional(
. -
(?:[A-Za-z]{3,9}s*)?
non-capturing gruop.[A-Za-z]{3,9}
between3-9
letters.s*
zero or more whitespace character.?
makes the whole group optinal.
-
(?:dd/)?
non-caputring group.d
a digit between0-9
.d
another digit between0-9
./
a literal forward slash/
.
-
[12]d{3}s*
[12]
match one digit from the listed digits1
or2
.d{3}
three digits between0-9
s*
zero or more whitespace character.
-
(?:–|-|[Tt][Oo])s*
(?:–|-|[Tt][Oo])
match–
,-
,TO
,to
,To
ortO
.s*
zero or more whitespace character.
-
(?:[A-Za-z]{3,9}s*)?
explained above. -
(?:dd/)?
explained above. -
[12]d{3}
explained above. -
)?
an optional)
.
See regex demo