Running Distinct Count in Pandas by a group
Question:
I have a dataframe ‘df’, with the following structure:
Input:
ID
Product
Price
1
P1
10
2
P1
11
3
P2
12
4
P2
12
5
P2
15
Expected Output:
ID
Product
Price
Distinct_Running_Count
1
P1
10
1
2
P1
11
2
3
P2
12
1
4
P2
12
1
5
P2
15
2
Problem:
I want to create a new column called ‘Distinct_Running_Count’, with the following logic:
- Perform a running distinct count of a column ‘Product’ based on
price
- Some products don’t have any price change, thus ‘Distinct_Running_Count’ will be 1
- Every subsequent price change, the ‘Distinct_Running_Count’ will be incremented
Solutions Tried:
df['Distinct_Running_Count'] = df.groupby(['Product', 'Price']).cumcount() + 1
df['Distinct_Running_Count'] = df.groupby(['Product', 'Price']).transform('nunique')
Issue:
The above solution either provides running count or the total uniques counts but not what I expect
Answers:
You can try to compare the row and next row in Price
column and calculate the cumsum
df['Distinct_Running_Count'] = (df.groupby(['Product'])['Price']
.transform(lambda col: col.ne(col.shift().fillna(col)).cumsum().add(1)))
print(df)
ID Product Price Distinct_Running_Count
0 1 P1 10 1
1 2 P1 11 2
2 3 P2 12 1
3 4 P2 12 1
4 5 P2 15 2
My answer uses a few steps.
First, get the unique rows (based on Product and Price).
Then, use cumcount()
to create your desired column.
Finally, merge this dataframe with your original dataframe.
df_without_dup = df[~df[['Product', 'Price']].duplicated()][['Product', 'Price']]
df_without_dup['Distinct_Running_Count'] = df_without_dup.groupby(['Product']).cumcount() + 1
df = df.merge(df_without_dup, on=['Product', 'Price'], how='left')
df_without_dup =
Product Price Distinct_Running_Count
0 P1 12 1
1 P1 11 2
2 P2 12 1
4 P2 15 2
Output:
ID Product Price Distinct_Running_Count
0 1 P1 12 1
1 2 P1 11 2
2 3 P2 12 1
3 4 P2 12 1
4 5 P2 15 2
I have a dataframe ‘df’, with the following structure:
Input:
ID | Product | Price |
---|---|---|
1 | P1 | 10 |
2 | P1 | 11 |
3 | P2 | 12 |
4 | P2 | 12 |
5 | P2 | 15 |
Expected Output:
ID | Product | Price | Distinct_Running_Count |
---|---|---|---|
1 | P1 | 10 | 1 |
2 | P1 | 11 | 2 |
3 | P2 | 12 | 1 |
4 | P2 | 12 | 1 |
5 | P2 | 15 | 2 |
Problem:
I want to create a new column called ‘Distinct_Running_Count’, with the following logic:
- Perform a running distinct count of a column ‘Product’ based on
price - Some products don’t have any price change, thus ‘Distinct_Running_Count’ will be 1
- Every subsequent price change, the ‘Distinct_Running_Count’ will be incremented
Solutions Tried:
df['Distinct_Running_Count'] = df.groupby(['Product', 'Price']).cumcount() + 1
df['Distinct_Running_Count'] = df.groupby(['Product', 'Price']).transform('nunique')
Issue:
The above solution either provides running count or the total uniques counts but not what I expect
You can try to compare the row and next row in Price
column and calculate the cumsum
df['Distinct_Running_Count'] = (df.groupby(['Product'])['Price']
.transform(lambda col: col.ne(col.shift().fillna(col)).cumsum().add(1)))
print(df)
ID Product Price Distinct_Running_Count
0 1 P1 10 1
1 2 P1 11 2
2 3 P2 12 1
3 4 P2 12 1
4 5 P2 15 2
My answer uses a few steps.
First, get the unique rows (based on Product and Price).
Then, use cumcount()
to create your desired column.
Finally, merge this dataframe with your original dataframe.
df_without_dup = df[~df[['Product', 'Price']].duplicated()][['Product', 'Price']]
df_without_dup['Distinct_Running_Count'] = df_without_dup.groupby(['Product']).cumcount() + 1
df = df.merge(df_without_dup, on=['Product', 'Price'], how='left')
df_without_dup =
Product Price Distinct_Running_Count
0 P1 12 1
1 P1 11 2
2 P2 12 1
4 P2 15 2
Output:
ID Product Price Distinct_Running_Count
0 1 P1 12 1
1 2 P1 11 2
2 3 P2 12 1
3 4 P2 12 1
4 5 P2 15 2