Split period into two dates when the date has the same delimiter
Question:
Goal: derive period start and period end from the column period, in the form of
dd.mm.yyyy – dd.mm.yyyy
period
28-02-2022 - 30.09.2022
31.01.2022 - 31.12.2022
28.02.2019 - 30-04-2020
20.01.2019-22.02.2020
19.03.2020- 24.05.2021
13.09.2022-12-10.2022
df[['period_start,'period_end]]= df['period'].str.split('-',expand=True)
will not work.
Expected output
period_start period_end
31.02.2022 30.09.2022
31.01.2022 31.12.2022
28.02.2019 30.04.2020
20.01.2019 22.02.2020
19.03.2020 24.05.2021
13.09.2022 12.10.2022
Answers:
We can use str.extract
here for one option:
df[["period_start", "period_end"]] = df["period"].str.extract(r'(S+)s*-s*(S+)')
.str.replace(r'-', '.')
Use a regex to split on the dash with surrounding spaces:
out = (df['period'].str.split(r's+-s+',expand=True)
.set_axis(['period_start', 'period_end'], axis=1)
)
or to remove the column and create new ones:
df[['period_start', 'period_end']] = df.pop('period').str.split(r's+-s+',expand=True)
output:
period_start period_end
0 31-02-2022 30.09.2022
1 31.01.2022 31.12.2022
2 28.02.2019 30-04-2020
the problem is you were trying to split on dash, and there’s many dashes in the one row, this work :
df[['period_start','period_end']]= df['period'].str.split(' - ',expand=True)
because we split on space + dash
You can use pandas.Series.str.extract
to capture the two dates.
Try this :
out = (
df['period'].str.replace('s*', '', regex=True)
.str.extract('(d{2}.d{2}.d{4}).(d{2}.d{2}.d{4})',expand=True)
.apply(pd.to_datetime, dayfirst=True)
.apply(lambda x: x.dt.strftime('%d.%m.%Y'))
.rename(columns={0: 'period_start', 1: 'period_end'})
)
# Output :
print(out)
period_start period_end
0 28.02.2022 30.09.2022
1 31.01.2022 31.12.2022
2 28.02.2019 30.04.2020
3 20.01.2019 22.02.2020
4 19.03.2020 24.05.2021
5 13.09.2022 12.10.2022
Goal: derive period start and period end from the column period, in the form of
dd.mm.yyyy – dd.mm.yyyy
period
28-02-2022 - 30.09.2022
31.01.2022 - 31.12.2022
28.02.2019 - 30-04-2020
20.01.2019-22.02.2020
19.03.2020- 24.05.2021
13.09.2022-12-10.2022
df[['period_start,'period_end]]= df['period'].str.split('-',expand=True)
will not work.
Expected output
period_start period_end
31.02.2022 30.09.2022
31.01.2022 31.12.2022
28.02.2019 30.04.2020
20.01.2019 22.02.2020
19.03.2020 24.05.2021
13.09.2022 12.10.2022
We can use str.extract
here for one option:
df[["period_start", "period_end"]] = df["period"].str.extract(r'(S+)s*-s*(S+)')
.str.replace(r'-', '.')
Use a regex to split on the dash with surrounding spaces:
out = (df['period'].str.split(r's+-s+',expand=True)
.set_axis(['period_start', 'period_end'], axis=1)
)
or to remove the column and create new ones:
df[['period_start', 'period_end']] = df.pop('period').str.split(r's+-s+',expand=True)
output:
period_start period_end
0 31-02-2022 30.09.2022
1 31.01.2022 31.12.2022
2 28.02.2019 30-04-2020
the problem is you were trying to split on dash, and there’s many dashes in the one row, this work :
df[['period_start','period_end']]= df['period'].str.split(' - ',expand=True)
because we split on space + dash
You can use pandas.Series.str.extract
to capture the two dates.
Try this :
out = (
df['period'].str.replace('s*', '', regex=True)
.str.extract('(d{2}.d{2}.d{4}).(d{2}.d{2}.d{4})',expand=True)
.apply(pd.to_datetime, dayfirst=True)
.apply(lambda x: x.dt.strftime('%d.%m.%Y'))
.rename(columns={0: 'period_start', 1: 'period_end'})
)
# Output :
print(out)
period_start period_end
0 28.02.2022 30.09.2022
1 31.01.2022 31.12.2022
2 28.02.2019 30.04.2020
3 20.01.2019 22.02.2020
4 19.03.2020 24.05.2021
5 13.09.2022 12.10.2022