How to count the number of unique values per group over the last n days

Question

I have the pandas dataframe below:

groupId	date	value
1	2023-01-01	A
1	2023-01-05	B
1	2023-01-17	C
2	2023-01-01	A
2	2023-01-20	B
3	2023-01-01	A
3	2023-01-10	B
3	2023-01-12	C

I would like to do a groupby and count the number of unique values for each groupId but only looking at the last n=14 days, relatively to the date of the row.

What I would like as a result is something like this:

groupId	date	value	newColumn
1	2023-01-01	A	1
1	2023-01-05	B	2
1	2023-01-17	C	2
2	2023-01-01	A	1
2	2023-01-20	B	1
3	2023-01-01	A	1
3	2023-01-10	B	2
3	2023-01-12	C	3

I did try using a groupby(...).rolling('14d').nunique() and while the rolling function works on numeric fields to count and compute the mean, etc … it doesn’t work when used with nunique on string fields to count the number of unique string/object values.

You can use the code below to generate the dataframe.

pd.DataFrame(
{
 'groupId': [1, 1, 1, 2, 2, 3, 3, 3],
 'date': ['2023-01-01', '2023-01-05', '2023-01-17', '2023-01-01', '2023-01-20', '2023-01-01', '2023-01-10', '2023-01-12'], #YYYY-MM-DD
 'value': ['A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'],
 'newColumn': [1, 2, 2, 1, 1, 1, 2, 3]
}

)

Do you have an idea on how to solve this, even if not using the rolling function? That’d be much appreciated!

Asked By: confused_pandas

||

Source

Answer 1

Instead of nunique, you can also use count:

>>> (df.groupby('groupId').rolling('14D', on='date')['value'].count()
       .astype(int).rename('newColumn').reset_index())

   groupId       date  newColumn
0        1 2023-01-01          1
1        1 2023-01-05          2
2        1 2023-01-17          2
3        2 2023-01-01          1
4        2 2023-01-20          1
5        3 2023-01-01          1
6        3 2023-01-10          2
7        3 2023-01-12          3

Caveats: it can be complicated to merge this output with your original dataframe except if (groupId, date) is a unique combination.

Update

If your index is numeric (or create a dummy column monotonic increasing), you can use this trick:

sr = (df.reset_index().groupby('groupId').rolling('14D', on='date')
        .agg({'value': 'count', 'index': 'max'}).astype(int)
        .set_index('index')['value'])
df['newColumn'] = sr
print(df)

# Output
   groupId       date value  newColumn
0        1 2023-01-01     A          1
1        1 2023-01-05     B          2
2        1 2023-01-17     C          2
3        2 2023-01-01     A          1
4        2 2023-01-20     B          1
5        3 2023-01-01     A          1
6        3 2023-01-10     B          2
7        3 2023-01-12     C          3

Update 2

You can use pd.factorize to convert value column as numeric column:

>>> (df.assign(value=pd.factorize(df['value'])[0])
       .groupby('groupId').rolling('14D', on='date')['value']
       .apply(lambda x: x.nunique())
       .astype(int).rename('newColumn').reset_index())

   groupId       date  newColumn
0        1 2023-01-01          1
1        1 2023-01-05          2
2        1 2023-01-17          2
3        2 2023-01-01          1
4        2 2023-01-20          1
5        3 2023-01-01          1
6        3 2023-01-10          2
7        3 2023-01-12          3

Answered By: Corralien

Answer 2

Another possible solution, which does not use rolling:

df['date'] = pd.to_datetime(df['date'])
df['new2'] = df.groupby('groupId')['date'].transform(
    lambda x: x.diff().dt.days.cumsum().le(14).mul(~x.duplicated()).cumsum()+1)

Output:

   groupId       date value  new2
0        1 2023-01-01     A     1
1        1 2023-01-05     B     2
2        1 2023-01-17     C     2
3        2 2023-01-01     A     1
4        2 2023-01-20     B     1
5        3 2023-01-01     A     1
6        3 2023-01-10     B     2
7        3 2023-01-12     C     3

Answered By: PaulS

How to count the number of unique values per group over the last n days

Question:

Answers: