PerformanceWarning when adding months to datetime values in Pandas DataFrame using apply() and DateOffset

Question:

I have a Pandas DataFrame with a "date" column containing datetime values, and a "months" column containing integer values representing the number of months to add to each date. I want to calculate a new date for each row that is the original "date" plus the specified number of months in the months column.

I’m using the following code:

df['date'] + df["months"].apply(lambda y: pd.DateOffset(months=y))

This code seems to work correctly, but I’m getting a PerformanceWarning that says:

PerformanceWarning: Adding/subtracting object-dtype array to DatetimeArray not vectorized
  warnings.warn

I need to do this process with multiple date and months columns in my dataset and is taking a long time to perform this task.

Sample of input:

date months
0 2020-01-01 3
1 2021-06-15 6
2 2022-02-28 1

Sample of ouput:

date months new_date
0 2020-01-01 3 2020-04-01
1 2021-06-15 6 2021-12-15
2 2022-02-28 1 2022-03-28

What does this warning mean, and how can I improve my code to avoid it? I’m concerned that this may be affecting the performance of my code, and I want to make sure that it’s as efficient as possible.

Asked By: Cristian Jara

||

Answers:

You can remove the performance warning by using .dt.date on the date column.

For faster performance you can use dateutil.relativedelta:

from dateutil.relativedelta import relativedelta

df['new_date'] = df['date'].dt.date + df["months"].apply(lambda y: relativedelta(months=y))
print(df)

Prints (without performance warning):

        date  months    new_date
0 2020-01-01       3  2020-04-01
1 2021-06-15       6  2021-12-15
2 2022-02-28       1  2022-03-28

Quick benchmark:

from timeit import timeit
from dateutil.relativedelta import relativedelta


def fn1(df):
    return df['date'].dt.date + df["months"].apply(lambda y: pd.DateOffset(months=y))

def fn2(df):
    return df['date'].dt.date + df["months"].apply(lambda y: relativedelta(months=y))

# create bigger dataframe
df = pd.concat([df]*100).reset_index(drop=True)

t1 = timeit('fn1(x)', setup='x = df.copy()', number=100, globals=globals())
t2 = timeit('fn2(x)', setup='x = df.copy()', number=100, globals=globals())

print(t1)
print(t2)

Prints:

0.6759202829562128
0.2161861159838736
Answered By: Andrej Kesely