How to make a retention calculation in pandas more efficient?

Question

I am trying to calculate 7day retention (did the user come back WITHIN 7 days?) on a user-id basis. Currently, I am using this code:

df_retention['seven_day_retention']=df_retention.groupby('user_id')['date'].transform(lambda x: ((x.shift(-1) - x).dt.days< 8).astype(int))

This procedure across 10M rows is taking hours and is not feasible. Is there a better way working within Databricks?

Asked By: titutubs

||

Source

Answer 1

Your code is very slow. I think you must change your approach. You can first sort your dataframe based on person id and date. Then you can use a for loop to compare each row and next row. This code has O(n). If you want you can use faster way. For example in the 2th section you can use from your sample code without groupby and transform and just calculate difference between each row and next row

Answered By: Alireza

Answer 2

I tested this and it seems way faster than your approach. Your approach scales really terribly with the number of users. I guess the groupby + the lambda is a particularly bad combo here.

Like @Confused Learner said you need to use builtin pandas methods, since they are written in C, and avoid lambdas, which are obviously written in Python.

import datetime
import random

import pandas as pd


# some synthetic data
k = int(1e3)
user_ids = random.choices(population=range(k), k=k)
months = random.choices(population=range(1, 12), k=k)
days = random.choices(population=range(1, 28), k=k)

# our synthetic dataframe
df_retention = pd.DataFrame(
    [   
        [user_id, datetime.datetime(2022, month, day)]
        for user_id, month, day in zip(user_ids, months, days)
    ],
    columns=["user_id", "date"]
)

df_retention.sort_values(by=["user_id", "date"], inplace=True)  # sort by user, then date

df_diff = df_retention[["user_id", "date"]].diff()  # take the difference of all the rows
retained = (df_diff["date"] <= datetime.timedelta(days=7)) & (df_diff["user_id"] == 0)  # True if diff is <= 7 days & it is the same user

retained.iloc[:-1] = retained.iloc[1:]  # shift the results
retained.iloc[-1] = False  # pad with False, since it's the last entry and we don't know if they ever returned
df_retention['seven_day_retention'] = retained

Here’s a sample of the output if you force user_id=0 and k=10:

   user_id       date  seven_day_retention
4        0 2022-01-02                 True
2        0 2022-01-08                False
9        0 2022-02-14                False
0        0 2022-03-06                False
1        0 2022-04-21                False
6        0 2022-05-23                 True
3        0 2022-05-25                False
5        0 2022-07-21                False
7        0 2022-08-06                False
8        0 2022-10-12                False

Answered By: ringo

How to make a retention calculation in pandas more efficient?

Question:

Answers: