Adding column to a Pandas df checking whether date range ever falls on a given month in any year
Question:
We have a dataframe of entries where we want to know which entries have ever existed within a given month of any year. Simplified eg:
import pandas as pd
import datetime as dt
df = pd.DataFrame(
{
"start": [dt.datetime(2020,1,1), dt.datetime(2020,8,1), dt.datetime(2020,8,1)],
"finish": [dt.datetime(2021,12,1), dt.datetime(2021,6,1), dt.datetime(2022,6,1)],
})
How can we add a column determining which entries ever existed on any July of any year? We can add this if we’re only concerned for July 2020: df['existed_in_july_2020'] = (df['start'] < dt.datetime(2020,7,1)) & (df['finish'] >= dt.datetime(2020,8,1))
, but this doesn’t have other years, and the third entry existed in July 2021.
In this eg df that column existed_in_july
would be:
df = pd.DataFrame(
{
"start": [dt.datetime(2020,1,1), dt.datetime(2020,8,1), dt.datetime(2020,8,1)],
"finish": [dt.datetime(2021,12,1), dt.datetime(2021,6,1), dt.datetime(2022,6,1)],
"existed_in_july": [True, False, True]
})
How can we create this column?
Answers:
One option that should work would be to check if either July of the start or finish year is in between the two dates, or if more than one year elapsed between the two:
m1 = df['start'].add(pd.DateOffset(month=7)).between(df['start'], df['finish'])
m2 = df['finish'].add(pd.DateOffset(month=7)).between(df['start'], df['finish'])
m3 = df['finish'].sub(df['start']).gt('1Y')
df['existed_in_july'] = m1|m2|m3
Output:
start finish existed_in_july
0 2020-01-01 2021-12-01 True
1 2020-08-01 2021-06-01 False
2 2020-08-01 2022-06-01 True
You can use month periods with test july month in list comprehension:
df['existed_in_july'] = [(pd.period_range(a, b, freq='m').month == 7).any()
for a, b in zip(df['start'], df['finish'])]
print (df)
start finish existed_in_july
0 2020-01-01 2021-12-01 True
1 2020-08-01 2021-06-01 False
2 2020-08-01 2022-06-01 True
We have a dataframe of entries where we want to know which entries have ever existed within a given month of any year. Simplified eg:
import pandas as pd
import datetime as dt
df = pd.DataFrame(
{
"start": [dt.datetime(2020,1,1), dt.datetime(2020,8,1), dt.datetime(2020,8,1)],
"finish": [dt.datetime(2021,12,1), dt.datetime(2021,6,1), dt.datetime(2022,6,1)],
})
How can we add a column determining which entries ever existed on any July of any year? We can add this if we’re only concerned for July 2020: df['existed_in_july_2020'] = (df['start'] < dt.datetime(2020,7,1)) & (df['finish'] >= dt.datetime(2020,8,1))
, but this doesn’t have other years, and the third entry existed in July 2021.
In this eg df that column existed_in_july
would be:
df = pd.DataFrame(
{
"start": [dt.datetime(2020,1,1), dt.datetime(2020,8,1), dt.datetime(2020,8,1)],
"finish": [dt.datetime(2021,12,1), dt.datetime(2021,6,1), dt.datetime(2022,6,1)],
"existed_in_july": [True, False, True]
})
How can we create this column?
One option that should work would be to check if either July of the start or finish year is in between the two dates, or if more than one year elapsed between the two:
m1 = df['start'].add(pd.DateOffset(month=7)).between(df['start'], df['finish'])
m2 = df['finish'].add(pd.DateOffset(month=7)).between(df['start'], df['finish'])
m3 = df['finish'].sub(df['start']).gt('1Y')
df['existed_in_july'] = m1|m2|m3
Output:
start finish existed_in_july
0 2020-01-01 2021-12-01 True
1 2020-08-01 2021-06-01 False
2 2020-08-01 2022-06-01 True
You can use month periods with test july month in list comprehension:
df['existed_in_july'] = [(pd.period_range(a, b, freq='m').month == 7).any()
for a, b in zip(df['start'], df['finish'])]
print (df)
start finish existed_in_july
0 2020-01-01 2021-12-01 True
1 2020-08-01 2021-06-01 False
2 2020-08-01 2022-06-01 True