Make a list of active subscriptions from a pandas database

Question:

I’m trying to create a list of active subscriptions in pandas, but all of my solutions seem too convoluted and most likely very slow, so I was wondering if there is an easier and/or better way to make it do what I want it to do.

My data looks something like this:

User Subscription Start Subscription Length Subscription End
John 2023-01-01 30 2023-01-31
John 2023-01-15 30 2023-02-14
Jack 2023-02-01 120 2023-06-01
Jack 2022-08-01 30 2023-08-31
Jeff 2023-01-01 30 2023-01-31

Users can buy multiple subscriptions and they add together, and users can also let their subscriptions stop and buy it at a later date. Ideally what I would like the output to be is to have one line for each ‘subscription instant’ for each user, so it would look something like this:

User Subscription Start Subscription Length Subscription End
John 2023-01-01 60 2023-03-02
Jack 2023-02-01 120 2023-06-01
Jack 2022-08-01 30 2023-08-31
Jeff 2023-01-01 30 2023-01-31

I tried to this with a for loop, where if a user has not had a subscription then they would be added to the list and if they did have one before it would check if the new subscription’s start date falls within any of the subscription ranges already in the list and then adds on top of one of them, but I had way too many for and if statements, and I’m sure there is a more elegant way to do it.

Asked By: Zera

||

Answers:

So after some thinking, I managed to come up with this, it works good enough:

li = df.iloc[0:0]
for x in df.index:
    if df.iloc[x,0] not in li.User.unique():
        li = li.append(df.iloc[x])
    else:
        if len(li[(li.User == df.iloc[x,0]) & (li.end_date >= df.iloc[x,2]) & (li.start_date <= df.iloc[x,2])]) > 0:
            li.loc[(li.clientid == df.iloc[x,0]) & (li.end_date >= df.iloc[x,2]) & (li.start_date <= df.iloc[x,2]),'sub_length'] += df.iloc[x,3]
        else:
            li = li.append(df.iloc[x])

I’ve found that this only works perfectly if the data is in chronological order. I’ve also made a for loop to check through the list for any errors:

li = li.reset_index(drop=True)
log2 = pd.DataFrame(columns=['Outcome','value'])
log2 = log2.iloc[0:0]
li2 = li.iloc[0:0]
for x in li.index:
        if len(li[(li.User== li.iloc[x,0]) & (li.end_date >= li.iloc[x,2]) & (li.start_date <= li.iloc[x,2])]) > 1:
            li2 = li2.append(li.iloc[x])
            log2 = log2.append({'Outcome':'Bad','value':1},ignore_index=True)
        else:
            log2 = log2.append({'Outcome':'Good','value':1},ignore_index=True)
log2.groupby('Outcome').count()

For this exact code to work the columns should be (in order): User platform start_date sub_length end_date

Answered By: Zera
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.