get difference of two dates using np.busday_count in pandas
Question:
Let’s say i have a dataframe like this:
date_1 date_2
0 2022-08-01 2022-08-05
1 2022-08-20 NaN
2 NaN NaN
I want to have another column which tells the difference in business days and have a dataframe like this (in case date_2
is empty, it will be compared to today’s date (2022-08-28
)):
date_1 date_2 diff
0 2022-08-01 2022-08-05 4
1 2022-08-20 NaN 5
2 NaN NaN Empty
I tried to use this one:
df["diff"] = df.apply(
lambda x: np.busday_count(x.date_1, x.date_2) if (x.date_1 != '' and x.date_2 != '') else (np.busday_count(x.date_1, np.datetime64('today')) if (x.date_1 != '' and x.date_2 == '') else ''), axis=1)
but im getting this error:
Iterator operand 0 dtype could not be cast from dtype('<M8[us]') to dtype('<M8[D]') according to the rule 'safe'
Any idea how to get the desired dataframe?
Answers:
I want to have another column which tells the difference in days
If you just want days:
df.assign(
diff=lambda df: (
pd.to_datetime(df["date_2"]).fillna(pd.Timestamp.now())
- pd.to_datetime(df["date_1"])
).dt.days
)
which outputs
date_1 date_2 diff
0 2022-08-01 2022-08-05 4.0
1 2022-08-20 NaN 8.0
2 NaN NaN NaN
EDIT: it was later clarified that they want business days, so please refer to the other answer instead
I think you just need to coerce the types. Also, better to avoid lambdas if you have more than one condition to check. Code below runnable as-is, though the second diff value will change if you run it tomorrow 🙂
def busday_diff(x):
if pd.isna(x.date_1):
return ""
date2_to_use = pd.Timestamp("today") if pd.isna(x.date_2) else x.date_2
return np.busday_count(np.datetime64(x.date_1, "D"), np.datetime64(date2_to_use, "D"))
df = pd.DataFrame(
{"date_1": ["2022-08-01", "2022-08-20", np.nan], "date_2": ["2022-08-05", np.nan, np.nan]}
)
df["diff"] = df.apply(busday_diff, axis=1)
​
print(df)
# date_1 date_2 diff
#0 2022-08-01 2022-08-05 4
#1 2022-08-20 NaT 5
#2 NaT NaT
If you have to do more than a couple of these, you will probably want to vectorize it. Pandas and Numpy are much much faster if you can vectorize your commands:
df = pd.DataFrame(
{
"date_1": ["2022-08-01", "2022-08-20", np.nan, np.nan],
"date_2": ["2022-08-05", np.nan, np.nan, "2022-08-10"],
}
)
calcable = df[~df.date_1.isnull()].fillna(pd.Timestamp("today").date())[["date_1", "date_2"]]
df["diff"] = pd.Series(
np.busday_count(
calcable.date_1.values.astype("datetime64[D]"),
calcable.date_2.values.astype("datetime64[D]"),
),
index=calcable.index,
)
Interestingly, the cast to "D" resolution must be called on the underlying numpy array values
. Otherwise it reverts back to "ns" resolution. This is probably the origin of the confusion behind this question. Strange design decision on the part of pandas:
calcable.date_1.values.astype("datetime64[D]")
# array(['2022-08-01', '2022-08-20'], dtype='datetime64[D]')
calcable.date_1.astype("datetime64[D]").values
# array(['2022-08-01T00:00:00.000000000', '2022-08-20T00:00:00.000000000'],
dtype='datetime64[ns]')
Let’s say i have a dataframe like this:
date_1 date_2
0 2022-08-01 2022-08-05
1 2022-08-20 NaN
2 NaN NaN
I want to have another column which tells the difference in business days and have a dataframe like this (in case date_2
is empty, it will be compared to today’s date (2022-08-28
)):
date_1 date_2 diff
0 2022-08-01 2022-08-05 4
1 2022-08-20 NaN 5
2 NaN NaN Empty
I tried to use this one:
df["diff"] = df.apply(
lambda x: np.busday_count(x.date_1, x.date_2) if (x.date_1 != '' and x.date_2 != '') else (np.busday_count(x.date_1, np.datetime64('today')) if (x.date_1 != '' and x.date_2 == '') else ''), axis=1)
but im getting this error:
Iterator operand 0 dtype could not be cast from dtype('<M8[us]') to dtype('<M8[D]') according to the rule 'safe'
Any idea how to get the desired dataframe?
I want to have another column which tells the difference in days
If you just want days:
df.assign(
diff=lambda df: (
pd.to_datetime(df["date_2"]).fillna(pd.Timestamp.now())
- pd.to_datetime(df["date_1"])
).dt.days
)
which outputs
date_1 date_2 diff
0 2022-08-01 2022-08-05 4.0
1 2022-08-20 NaN 8.0
2 NaN NaN NaN
EDIT: it was later clarified that they want business days, so please refer to the other answer instead
I think you just need to coerce the types. Also, better to avoid lambdas if you have more than one condition to check. Code below runnable as-is, though the second diff value will change if you run it tomorrow 🙂
def busday_diff(x):
if pd.isna(x.date_1):
return ""
date2_to_use = pd.Timestamp("today") if pd.isna(x.date_2) else x.date_2
return np.busday_count(np.datetime64(x.date_1, "D"), np.datetime64(date2_to_use, "D"))
df = pd.DataFrame(
{"date_1": ["2022-08-01", "2022-08-20", np.nan], "date_2": ["2022-08-05", np.nan, np.nan]}
)
df["diff"] = df.apply(busday_diff, axis=1)
​
print(df)
# date_1 date_2 diff
#0 2022-08-01 2022-08-05 4
#1 2022-08-20 NaT 5
#2 NaT NaT
If you have to do more than a couple of these, you will probably want to vectorize it. Pandas and Numpy are much much faster if you can vectorize your commands:
df = pd.DataFrame(
{
"date_1": ["2022-08-01", "2022-08-20", np.nan, np.nan],
"date_2": ["2022-08-05", np.nan, np.nan, "2022-08-10"],
}
)
calcable = df[~df.date_1.isnull()].fillna(pd.Timestamp("today").date())[["date_1", "date_2"]]
df["diff"] = pd.Series(
np.busday_count(
calcable.date_1.values.astype("datetime64[D]"),
calcable.date_2.values.astype("datetime64[D]"),
),
index=calcable.index,
)
Interestingly, the cast to "D" resolution must be called on the underlying numpy array values
. Otherwise it reverts back to "ns" resolution. This is probably the origin of the confusion behind this question. Strange design decision on the part of pandas:
calcable.date_1.values.astype("datetime64[D]")
# array(['2022-08-01', '2022-08-20'], dtype='datetime64[D]')
calcable.date_1.astype("datetime64[D]").values
# array(['2022-08-01T00:00:00.000000000', '2022-08-20T00:00:00.000000000'],
dtype='datetime64[ns]')