How to count the number of unique values per group over the last n days
Question:
I have the pandas dataframe below:
groupId
date
value
1
2023-01-01
A
1
2023-01-05
B
1
2023-01-17
C
2
2023-01-01
A
2
2023-01-20
B
3
2023-01-01
A
3
2023-01-10
B
3
2023-01-12
C
I would like to do a groupby and count the number of unique values for each groupId
but only looking at the last n=14 days, relatively to the date
of the row.
What I would like as a result is something like this:
groupId
date
value
newColumn
1
2023-01-01
A
1
1
2023-01-05
B
2
1
2023-01-17
C
2
2
2023-01-01
A
1
2
2023-01-20
B
1
3
2023-01-01
A
1
3
2023-01-10
B
2
3
2023-01-12
C
3
I did try using a groupby(...).rolling('14d').nunique()
and while the rolling
function works on numeric fields to count and compute the mean, etc … it doesn’t work when used with nunique
on string fields to count the number of unique string/object values.
You can use the code below to generate the dataframe.
pd.DataFrame(
{
'groupId': [1, 1, 1, 2, 2, 3, 3, 3],
'date': ['2023-01-01', '2023-01-05', '2023-01-17', '2023-01-01', '2023-01-20', '2023-01-01', '2023-01-10', '2023-01-12'], #YYYY-MM-DD
'value': ['A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'],
'newColumn': [1, 2, 2, 1, 1, 1, 2, 3]
}
)
Do you have an idea on how to solve this, even if not using the rolling
function? That’d be much appreciated!
Answers:
Instead of nunique
, you can also use count
:
>>> (df.groupby('groupId').rolling('14D', on='date')['value'].count()
.astype(int).rename('newColumn').reset_index())
groupId date newColumn
0 1 2023-01-01 1
1 1 2023-01-05 2
2 1 2023-01-17 2
3 2 2023-01-01 1
4 2 2023-01-20 1
5 3 2023-01-01 1
6 3 2023-01-10 2
7 3 2023-01-12 3
Caveats: it can be complicated to merge this output with your original dataframe except if (groupId, date) is a unique combination.
Update
If your index is numeric (or create a dummy column monotonic increasing), you can use this trick:
sr = (df.reset_index().groupby('groupId').rolling('14D', on='date')
.agg({'value': 'count', 'index': 'max'}).astype(int)
.set_index('index')['value'])
df['newColumn'] = sr
print(df)
# Output
groupId date value newColumn
0 1 2023-01-01 A 1
1 1 2023-01-05 B 2
2 1 2023-01-17 C 2
3 2 2023-01-01 A 1
4 2 2023-01-20 B 1
5 3 2023-01-01 A 1
6 3 2023-01-10 B 2
7 3 2023-01-12 C 3
Update 2
You can use pd.factorize
to convert value
column as numeric column:
>>> (df.assign(value=pd.factorize(df['value'])[0])
.groupby('groupId').rolling('14D', on='date')['value']
.apply(lambda x: x.nunique())
.astype(int).rename('newColumn').reset_index())
groupId date newColumn
0 1 2023-01-01 1
1 1 2023-01-05 2
2 1 2023-01-17 2
3 2 2023-01-01 1
4 2 2023-01-20 1
5 3 2023-01-01 1
6 3 2023-01-10 2
7 3 2023-01-12 3
Another possible solution, which does not use rolling
:
df['date'] = pd.to_datetime(df['date'])
df['new2'] = df.groupby('groupId')['date'].transform(
lambda x: x.diff().dt.days.cumsum().le(14).mul(~x.duplicated()).cumsum()+1)
Output:
groupId date value new2
0 1 2023-01-01 A 1
1 1 2023-01-05 B 2
2 1 2023-01-17 C 2
3 2 2023-01-01 A 1
4 2 2023-01-20 B 1
5 3 2023-01-01 A 1
6 3 2023-01-10 B 2
7 3 2023-01-12 C 3
I have the pandas dataframe below:
groupId | date | value |
---|---|---|
1 | 2023-01-01 | A |
1 | 2023-01-05 | B |
1 | 2023-01-17 | C |
2 | 2023-01-01 | A |
2 | 2023-01-20 | B |
3 | 2023-01-01 | A |
3 | 2023-01-10 | B |
3 | 2023-01-12 | C |
I would like to do a groupby and count the number of unique values for each groupId
but only looking at the last n=14 days, relatively to the date
of the row.
What I would like as a result is something like this:
groupId | date | value | newColumn |
---|---|---|---|
1 | 2023-01-01 | A | 1 |
1 | 2023-01-05 | B | 2 |
1 | 2023-01-17 | C | 2 |
2 | 2023-01-01 | A | 1 |
2 | 2023-01-20 | B | 1 |
3 | 2023-01-01 | A | 1 |
3 | 2023-01-10 | B | 2 |
3 | 2023-01-12 | C | 3 |
I did try using a groupby(...).rolling('14d').nunique()
and while the rolling
function works on numeric fields to count and compute the mean, etc … it doesn’t work when used with nunique
on string fields to count the number of unique string/object values.
You can use the code below to generate the dataframe.
pd.DataFrame(
{
'groupId': [1, 1, 1, 2, 2, 3, 3, 3],
'date': ['2023-01-01', '2023-01-05', '2023-01-17', '2023-01-01', '2023-01-20', '2023-01-01', '2023-01-10', '2023-01-12'], #YYYY-MM-DD
'value': ['A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'],
'newColumn': [1, 2, 2, 1, 1, 1, 2, 3]
}
)
Do you have an idea on how to solve this, even if not using the rolling
function? That’d be much appreciated!
Instead of nunique
, you can also use count
:
>>> (df.groupby('groupId').rolling('14D', on='date')['value'].count()
.astype(int).rename('newColumn').reset_index())
groupId date newColumn
0 1 2023-01-01 1
1 1 2023-01-05 2
2 1 2023-01-17 2
3 2 2023-01-01 1
4 2 2023-01-20 1
5 3 2023-01-01 1
6 3 2023-01-10 2
7 3 2023-01-12 3
Caveats: it can be complicated to merge this output with your original dataframe except if (groupId, date) is a unique combination.
Update
If your index is numeric (or create a dummy column monotonic increasing), you can use this trick:
sr = (df.reset_index().groupby('groupId').rolling('14D', on='date')
.agg({'value': 'count', 'index': 'max'}).astype(int)
.set_index('index')['value'])
df['newColumn'] = sr
print(df)
# Output
groupId date value newColumn
0 1 2023-01-01 A 1
1 1 2023-01-05 B 2
2 1 2023-01-17 C 2
3 2 2023-01-01 A 1
4 2 2023-01-20 B 1
5 3 2023-01-01 A 1
6 3 2023-01-10 B 2
7 3 2023-01-12 C 3
Update 2
You can use pd.factorize
to convert value
column as numeric column:
>>> (df.assign(value=pd.factorize(df['value'])[0])
.groupby('groupId').rolling('14D', on='date')['value']
.apply(lambda x: x.nunique())
.astype(int).rename('newColumn').reset_index())
groupId date newColumn
0 1 2023-01-01 1
1 1 2023-01-05 2
2 1 2023-01-17 2
3 2 2023-01-01 1
4 2 2023-01-20 1
5 3 2023-01-01 1
6 3 2023-01-10 2
7 3 2023-01-12 3
Another possible solution, which does not use rolling
:
df['date'] = pd.to_datetime(df['date'])
df['new2'] = df.groupby('groupId')['date'].transform(
lambda x: x.diff().dt.days.cumsum().le(14).mul(~x.duplicated()).cumsum()+1)
Output:
groupId date value new2
0 1 2023-01-01 A 1
1 1 2023-01-05 B 2
2 1 2023-01-17 C 2
3 2 2023-01-01 A 1
4 2 2023-01-20 B 1
5 3 2023-01-01 A 1
6 3 2023-01-10 B 2
7 3 2023-01-12 C 3