How to make a retention calculation in pandas more efficient?
Question:
I am trying to calculate 7day retention (did the user come back WITHIN 7 days?) on a user-id basis. Currently, I am using this code:
df_retention['seven_day_retention']=df_retention.groupby('user_id')['date'].transform(lambda x: ((x.shift(-1) - x).dt.days< 8).astype(int))
This procedure across 10M rows is taking hours and is not feasible. Is there a better way working within Databricks?
Answers:
Your code is very slow. I think you must change your approach. You can first sort your dataframe based on person id and date. Then you can use a for loop to compare each row and next row. This code has O(n). If you want you can use faster way. For example in the 2th section you can use from your sample code without groupby and transform and just calculate difference between each row and next row
I tested this and it seems way faster than your approach. Your approach scales really terribly with the number of users. I guess the groupby + the lambda is a particularly bad combo here.
Like @Confused Learner said you need to use builtin pandas
methods, since they are written in C, and avoid lambdas, which are obviously written in Python.
import datetime
import random
import pandas as pd
# some synthetic data
k = int(1e3)
user_ids = random.choices(population=range(k), k=k)
months = random.choices(population=range(1, 12), k=k)
days = random.choices(population=range(1, 28), k=k)
# our synthetic dataframe
df_retention = pd.DataFrame(
[
[user_id, datetime.datetime(2022, month, day)]
for user_id, month, day in zip(user_ids, months, days)
],
columns=["user_id", "date"]
)
df_retention.sort_values(by=["user_id", "date"], inplace=True) # sort by user, then date
df_diff = df_retention[["user_id", "date"]].diff() # take the difference of all the rows
retained = (df_diff["date"] <= datetime.timedelta(days=7)) & (df_diff["user_id"] == 0) # True if diff is <= 7 days & it is the same user
retained.iloc[:-1] = retained.iloc[1:] # shift the results
retained.iloc[-1] = False # pad with False, since it's the last entry and we don't know if they ever returned
df_retention['seven_day_retention'] = retained
Here’s a sample of the output if you force user_id=0
and k=10
:
user_id date seven_day_retention
4 0 2022-01-02 True
2 0 2022-01-08 False
9 0 2022-02-14 False
0 0 2022-03-06 False
1 0 2022-04-21 False
6 0 2022-05-23 True
3 0 2022-05-25 False
5 0 2022-07-21 False
7 0 2022-08-06 False
8 0 2022-10-12 False
I am trying to calculate 7day retention (did the user come back WITHIN 7 days?) on a user-id basis. Currently, I am using this code:
df_retention['seven_day_retention']=df_retention.groupby('user_id')['date'].transform(lambda x: ((x.shift(-1) - x).dt.days< 8).astype(int))
This procedure across 10M rows is taking hours and is not feasible. Is there a better way working within Databricks?
Your code is very slow. I think you must change your approach. You can first sort your dataframe based on person id and date. Then you can use a for loop to compare each row and next row. This code has O(n). If you want you can use faster way. For example in the 2th section you can use from your sample code without groupby and transform and just calculate difference between each row and next row
I tested this and it seems way faster than your approach. Your approach scales really terribly with the number of users. I guess the groupby + the lambda is a particularly bad combo here.
Like @Confused Learner said you need to use builtin pandas
methods, since they are written in C, and avoid lambdas, which are obviously written in Python.
import datetime
import random
import pandas as pd
# some synthetic data
k = int(1e3)
user_ids = random.choices(population=range(k), k=k)
months = random.choices(population=range(1, 12), k=k)
days = random.choices(population=range(1, 28), k=k)
# our synthetic dataframe
df_retention = pd.DataFrame(
[
[user_id, datetime.datetime(2022, month, day)]
for user_id, month, day in zip(user_ids, months, days)
],
columns=["user_id", "date"]
)
df_retention.sort_values(by=["user_id", "date"], inplace=True) # sort by user, then date
df_diff = df_retention[["user_id", "date"]].diff() # take the difference of all the rows
retained = (df_diff["date"] <= datetime.timedelta(days=7)) & (df_diff["user_id"] == 0) # True if diff is <= 7 days & it is the same user
retained.iloc[:-1] = retained.iloc[1:] # shift the results
retained.iloc[-1] = False # pad with False, since it's the last entry and we don't know if they ever returned
df_retention['seven_day_retention'] = retained
Here’s a sample of the output if you force user_id=0
and k=10
:
user_id date seven_day_retention
4 0 2022-01-02 True
2 0 2022-01-08 False
9 0 2022-02-14 False
0 0 2022-03-06 False
1 0 2022-04-21 False
6 0 2022-05-23 True
3 0 2022-05-25 False
5 0 2022-07-21 False
7 0 2022-08-06 False
8 0 2022-10-12 False