How to count the number of unique values per group over the last n days

Question:

I have the pandas dataframe below:

groupId date value
1 2023-01-01 A
1 2023-01-05 B
1 2023-01-17 C
2 2023-01-01 A
2 2023-01-20 B
3 2023-01-01 A
3 2023-01-10 B
3 2023-01-12 C

I would like to do a groupby and count the number of unique values for each groupId but only looking at the last n=14 days, relatively to the date of the row.

What I would like as a result is something like this:

groupId date value newColumn
1 2023-01-01 A 1
1 2023-01-05 B 2
1 2023-01-17 C 2
2 2023-01-01 A 1
2 2023-01-20 B 1
3 2023-01-01 A 1
3 2023-01-10 B 2
3 2023-01-12 C 3

I did try using a groupby(...).rolling('14d').nunique() and while the rolling function works on numeric fields to count and compute the mean, etc … it doesn’t work when used with nunique on string fields to count the number of unique string/object values.

You can use the code below to generate the dataframe.

pd.DataFrame(
{
 'groupId': [1, 1, 1, 2, 2, 3, 3, 3],
 'date': ['2023-01-01', '2023-01-05', '2023-01-17', '2023-01-01', '2023-01-20', '2023-01-01', '2023-01-10', '2023-01-12'], #YYYY-MM-DD
 'value': ['A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'],
 'newColumn': [1, 2, 2, 1, 1, 1, 2, 3]
}

)

Do you have an idea on how to solve this, even if not using the rolling function? That’d be much appreciated!

Asked By: confused_pandas

||

Answers:

Instead of nunique, you can also use count:

>>> (df.groupby('groupId').rolling('14D', on='date')['value'].count()
       .astype(int).rename('newColumn').reset_index())

   groupId       date  newColumn
0        1 2023-01-01          1
1        1 2023-01-05          2
2        1 2023-01-17          2
3        2 2023-01-01          1
4        2 2023-01-20          1
5        3 2023-01-01          1
6        3 2023-01-10          2
7        3 2023-01-12          3

Caveats: it can be complicated to merge this output with your original dataframe except if (groupId, date) is a unique combination.

Update

If your index is numeric (or create a dummy column monotonic increasing), you can use this trick:

sr = (df.reset_index().groupby('groupId').rolling('14D', on='date')
        .agg({'value': 'count', 'index': 'max'}).astype(int)
        .set_index('index')['value'])
df['newColumn'] = sr
print(df)

# Output
   groupId       date value  newColumn
0        1 2023-01-01     A          1
1        1 2023-01-05     B          2
2        1 2023-01-17     C          2
3        2 2023-01-01     A          1
4        2 2023-01-20     B          1
5        3 2023-01-01     A          1
6        3 2023-01-10     B          2
7        3 2023-01-12     C          3

Update 2

You can use pd.factorize to convert value column as numeric column:

>>> (df.assign(value=pd.factorize(df['value'])[0])
       .groupby('groupId').rolling('14D', on='date')['value']
       .apply(lambda x: x.nunique())
       .astype(int).rename('newColumn').reset_index())

   groupId       date  newColumn
0        1 2023-01-01          1
1        1 2023-01-05          2
2        1 2023-01-17          2
3        2 2023-01-01          1
4        2 2023-01-20          1
5        3 2023-01-01          1
6        3 2023-01-10          2
7        3 2023-01-12          3
Answered By: Corralien

Another possible solution, which does not use rolling:

df['date'] = pd.to_datetime(df['date'])
df['new2'] = df.groupby('groupId')['date'].transform(
    lambda x: x.diff().dt.days.cumsum().le(14).mul(~x.duplicated()).cumsum()+1)

Output:

   groupId       date value  new2
0        1 2023-01-01     A     1
1        1 2023-01-05     B     2
2        1 2023-01-17     C     2
3        2 2023-01-01     A     1
4        2 2023-01-20     B     1
5        3 2023-01-01     A     1
6        3 2023-01-10     B     2
7        3 2023-01-12     C     3
Answered By: PaulS
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.