get difference of two dates using np.busday_count in pandas

Question:

Let’s say i have a dataframe like this:

     date_1         date_2
0  2022-08-01     2022-08-05
1  2022-08-20        NaN
2    NaN             NaN

I want to have another column which tells the difference in business days and have a dataframe like this (in case date_2 is empty, it will be compared to today’s date (2022-08-28)):

     date_1         date_2          diff
0  2022-08-01     2022-08-05          4
1  2022-08-20        NaN              5
2    NaN             NaN              Empty

I tried to use this one:

df["diff"] = df.apply(
lambda x: np.busday_count(x.date_1, x.date_2) if (x.date_1 != '' and x.date_2 != '') else (np.busday_count(x.date_1, np.datetime64('today')) if (x.date_1 != '' and x.date_2 == '') else ''), axis=1)

but im getting this error:

Iterator operand 0 dtype could not be cast from dtype('<M8[us]') to dtype('<M8[D]') according to the rule 'safe'

Any idea how to get the desired dataframe?

Asked By: user14073111

||

Answers:

I want to have another column which tells the difference in days

If you just want days:

df.assign(
    diff=lambda df: (
        pd.to_datetime(df["date_2"]).fillna(pd.Timestamp.now())
        - pd.to_datetime(df["date_1"])
    ).dt.days
)

which outputs

       date_1      date_2  diff
0  2022-08-01  2022-08-05   4.0
1  2022-08-20         NaN   8.0
2         NaN         NaN   NaN

EDIT: it was later clarified that they want business days, so please refer to the other answer instead

Answered By: ignoring_gravity

I think you just need to coerce the types. Also, better to avoid lambdas if you have more than one condition to check. Code below runnable as-is, though the second diff value will change if you run it tomorrow 🙂

def busday_diff(x):
    if pd.isna(x.date_1):
        return ""
    date2_to_use = pd.Timestamp("today") if pd.isna(x.date_2) else x.date_2
    return np.busday_count(np.datetime64(x.date_1, "D"), np.datetime64(date2_to_use, "D"))

df = pd.DataFrame(
    {"date_1": ["2022-08-01", "2022-08-20", np.nan], "date_2": ["2022-08-05", np.nan, np.nan]}
)

df["diff"] = df.apply(busday_diff, axis=1)

​
print(df)

#      date_1     date_2 diff
#0 2022-08-01 2022-08-05    4
#1 2022-08-20        NaT    5 
#2        NaT        NaT     

If you have to do more than a couple of these, you will probably want to vectorize it. Pandas and Numpy are much much faster if you can vectorize your commands:

df = pd.DataFrame(
    {
        "date_1": ["2022-08-01", "2022-08-20", np.nan, np.nan],
        "date_2": ["2022-08-05", np.nan, np.nan, "2022-08-10"],
    }
)
calcable = df[~df.date_1.isnull()].fillna(pd.Timestamp("today").date())[["date_1", "date_2"]]

df["diff"] = pd.Series(
    np.busday_count(
        calcable.date_1.values.astype("datetime64[D]"),
        calcable.date_2.values.astype("datetime64[D]"),
    ),
    index=calcable.index,
)

Interestingly, the cast to "D" resolution must be called on the underlying numpy array values. Otherwise it reverts back to "ns" resolution. This is probably the origin of the confusion behind this question. Strange design decision on the part of pandas:

calcable.date_1.values.astype("datetime64[D]")

# array(['2022-08-01', '2022-08-20'], dtype='datetime64[D]')

calcable.date_1.astype("datetime64[D]").values

# array(['2022-08-01T00:00:00.000000000', '2022-08-20T00:00:00.000000000'],
      dtype='datetime64[ns]')
Answered By: mmdanziger