Pandas – Number of Months Between Two Dates
Question:
I think this should be simple but what I’ve seen are techniques that involve iterating over a dataframe date fields to determine the diff between two dates. And I’m having trouble with it. I’m familiar with MSSQL DATEDIFF so I thought Pandas datetime would have something similar. I perhaps it does but I’m missing it.
Is there a Pandonic way of determing the number of months as an integer between two dates (datetime) without the need to iterate? Keep in mind that there potentially are millions of rows so performance is a consideration.
The dates are datetime objects and the result would like this – new column being Month:
Date1 Date2 Months
2016-04-07 2017-02-01 11
2017-02-01 2017-03-05 1
Answers:
Here is a very simple answer my friend:
df['nb_months'] = ((df.date2 - df.date1)/np.timedelta64(1, 'M'))
and now:
df['nb_months'] = df['nb_months'].astype(int)
df.assign(
Months=
(df.Date2.dt.year - df.Date1.dt.year) * 12 +
(df.Date2.dt.month - df.Date1.dt.month)
)
Date1 Date2 Months
0 2016-04-07 2017-02-01 10
1 2017-02-01 2017-03-05 1
An alternative, possibly more elegant solution is
df.Date2.dt.to_period('M') - df.Date1.dt.to_period('M')
, which avoids rounding errors.
There are two notions of difference in time, which are both correct in a certain sense. Let us compare the difference in months between July 31 and September 01:
import numpy as np
import pandas as pd
dtr = pd.date_range(start="2016-07-31", end="2016-09-01", freq="D")
delta1 = int((dtr[-1] - dtr[0])/np.timedelta64(1,'M'))
delta2 = (dtr[-1].to_period('M') - dtr[0].to_period('M')).n
print(delta1,delta2)
Using numpy’s timedelta, delta1=1
, which is correct given that there is only one month in between, but delta2=2
, which is also correct given that September is still two months away in July. In most cases, both will give the same answer, but one might be more correct than the other given the context.
Just a small addition to @pberkes answer.
In case you want the answer as integer values and NOT as pandas._libs.tslibs.offsets.MonthEnd, just append .n
to the above code.
(pd.to_datetime('today').to_period('M') - pd.to_datetime('2020-01-01').to_period('M')).n
# [Out]:
# 7
This works with pandas 1.1.1:
df['Months'] = df['Date2'].dt.to_period('M').astype(int) - df['Date1'].dt.to_period('M').astype(int)
df
# Out[11]:
# Date1 Date2 Months
# 0 2016-04-07 2017-02-01 10
# 1 2017-02-01 2017-03-05 1
I think this should be simple but what I’ve seen are techniques that involve iterating over a dataframe date fields to determine the diff between two dates. And I’m having trouble with it. I’m familiar with MSSQL DATEDIFF so I thought Pandas datetime would have something similar. I perhaps it does but I’m missing it.
Is there a Pandonic way of determing the number of months as an integer between two dates (datetime) without the need to iterate? Keep in mind that there potentially are millions of rows so performance is a consideration.
The dates are datetime objects and the result would like this – new column being Month:
Date1 Date2 Months
2016-04-07 2017-02-01 11
2017-02-01 2017-03-05 1
Here is a very simple answer my friend:
df['nb_months'] = ((df.date2 - df.date1)/np.timedelta64(1, 'M'))
and now:
df['nb_months'] = df['nb_months'].astype(int)
df.assign(
Months=
(df.Date2.dt.year - df.Date1.dt.year) * 12 +
(df.Date2.dt.month - df.Date1.dt.month)
)
Date1 Date2 Months
0 2016-04-07 2017-02-01 10
1 2017-02-01 2017-03-05 1
An alternative, possibly more elegant solution is
df.Date2.dt.to_period('M') - df.Date1.dt.to_period('M')
, which avoids rounding errors.
There are two notions of difference in time, which are both correct in a certain sense. Let us compare the difference in months between July 31 and September 01:
import numpy as np
import pandas as pd
dtr = pd.date_range(start="2016-07-31", end="2016-09-01", freq="D")
delta1 = int((dtr[-1] - dtr[0])/np.timedelta64(1,'M'))
delta2 = (dtr[-1].to_period('M') - dtr[0].to_period('M')).n
print(delta1,delta2)
Using numpy’s timedelta, delta1=1
, which is correct given that there is only one month in between, but delta2=2
, which is also correct given that September is still two months away in July. In most cases, both will give the same answer, but one might be more correct than the other given the context.
Just a small addition to @pberkes answer.
In case you want the answer as integer values and NOT as pandas._libs.tslibs.offsets.MonthEnd, just append .n
to the above code.
(pd.to_datetime('today').to_period('M') - pd.to_datetime('2020-01-01').to_period('M')).n
# [Out]:
# 7
This works with pandas 1.1.1:
df['Months'] = df['Date2'].dt.to_period('M').astype(int) - df['Date1'].dt.to_period('M').astype(int)
df
# Out[11]:
# Date1 Date2 Months
# 0 2016-04-07 2017-02-01 10
# 1 2017-02-01 2017-03-05 1