How can I count the number of times an entry is a repeat of the previous entry within a column while grouping by another column in Python?
Question:
Consider the following table for example:
import pandas as pd
data = {'Group':["AGroup", "AGroup", "AGroup", "AGroup", "BGroup", "BGroup", "BGroup", "BGroup", "CGroup", "CGroup", "CGroup", "CGroup"],
'Status':["Low", "Low", "High", "High", "High", "Low", "High", "Low", "Low", "Low", "High", "High"],
'CountByGroup':[1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 2]}
pd.DataFrame(data)
This creates the following table:
Group Status CountByGroup
AGroup Low 1
AGroup Low 2
AGroup High 1
AGroup High 2
BGroup High 1
BGroup Low 1
BGroup High 1
BGroup Low 1
CGroup Low 1
CGroup Low 2
CGroup High 1
CGroup High 2
The CountByGroup column is what I am trying to create. Here you can see that "Low" appeared once so far for the "AGroup" in the first row, so it has an entry of 1. "Low" directly follows the same entry "Low" in the second row, so it has an entry of 2. If it were to appear a third time in a row in the third row, CountByGroup would display an entry of 3.
We’re also grouping these "Group", so the first entry for a new group is always 1 since it is the first time any entry has appeared for the group.
This was solved in a previous question I had using R which is available here, but I’m not sure how to solve this using Python.
Answers:
Assuming your dataframe is correctly sorted or use df.sort_values('Group', kind='stable')
before:
cols = ['Group', 'Status']
df['CountByGroup2'] = df[cols].eq(df[cols].shift()).all(axis=1).astype(int).add(1)
print(df)
# Output
Group Status CountByGroup CountByGroup2
0 AGroup Low 1 1
1 AGroup Low 2 2
2 AGroup High 1 1
3 AGroup High 2 2
4 BGroup High 1 1
5 BGroup Low 1 1
6 BGroup High 1 1
7 BGroup Low 1 1
8 CGroup Low 1 1
9 CGroup Low 2 2
10 CGroup High 1 1
11 CGroup High 2 2
Usually, you would use cumsum
on the row differences to identify continuous blocks. Then you can group by the Group
and the Block
then cumcount
:
# assuming data is sorted by Group as in the example
blocks = df['Status'].ne(df['Status'].shift()).cumsum()
df['CountByGroup'] = df.groupby(['Group', blocks]).cumcount() + 1
Note if the data is not sorted by Group
, you would need to sort before creating the blocks:
blocks = df['Status'].ne(df.sort_values('Group', kind='stable')['Status'].shift()).cumsum()
or a groupby:
blocks = df['Status'].ne(df.groupby('Group')['Status'].shift()).cumsum()
Output:
Group Status CountByGroup
0 AGroup Low 1
1 AGroup Low 2
2 AGroup High 1
3 AGroup High 2
4 BGroup High 1
5 BGroup Low 1
6 BGroup High 1
7 BGroup Low 1
8 CGroup Low 1
9 CGroup Low 2
10 CGroup High 1
11 CGroup High 2
import pandas as pd
data = {'Group':["AGroup", "AGroup", "AGroup", "AGroup", "BGroup", "BGroup", "BGroup", "BGroup", "CGroup", "CGroup", "CGroup", "CGroup"],
'Status':["Low", "Low", "High", "High", "High", "Low", "High", "Low", "Low", "Low", "High", "High"],
'CountByGroup':[1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 2]}
df = pd.DataFrame(data)
s = df[['Group', 'Status']].groupby(['Group', 'Status']).ngroup()
s = s.groupby(lambda x: s[x]).cumcount()+1
print(pd.concat([df[['Group', 'Status']], s.rename("CountByGroup")], axis=1))
Group Status CountByGroup
0 AGroup Low 1
1 AGroup Low 2
2 AGroup High 1
3 AGroup High 2
4 BGroup High 1
5 BGroup Low 1
6 BGroup High 2
7 BGroup Low 2
8 CGroup Low 1
9 CGroup Low 2
10 CGroup High 1
11 CGroup High 2
Consider the following table for example:
import pandas as pd
data = {'Group':["AGroup", "AGroup", "AGroup", "AGroup", "BGroup", "BGroup", "BGroup", "BGroup", "CGroup", "CGroup", "CGroup", "CGroup"],
'Status':["Low", "Low", "High", "High", "High", "Low", "High", "Low", "Low", "Low", "High", "High"],
'CountByGroup':[1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 2]}
pd.DataFrame(data)
This creates the following table:
Group Status CountByGroup
AGroup Low 1
AGroup Low 2
AGroup High 1
AGroup High 2
BGroup High 1
BGroup Low 1
BGroup High 1
BGroup Low 1
CGroup Low 1
CGroup Low 2
CGroup High 1
CGroup High 2
The CountByGroup column is what I am trying to create. Here you can see that "Low" appeared once so far for the "AGroup" in the first row, so it has an entry of 1. "Low" directly follows the same entry "Low" in the second row, so it has an entry of 2. If it were to appear a third time in a row in the third row, CountByGroup would display an entry of 3.
We’re also grouping these "Group", so the first entry for a new group is always 1 since it is the first time any entry has appeared for the group.
This was solved in a previous question I had using R which is available here, but I’m not sure how to solve this using Python.
Assuming your dataframe is correctly sorted or use df.sort_values('Group', kind='stable')
before:
cols = ['Group', 'Status']
df['CountByGroup2'] = df[cols].eq(df[cols].shift()).all(axis=1).astype(int).add(1)
print(df)
# Output
Group Status CountByGroup CountByGroup2
0 AGroup Low 1 1
1 AGroup Low 2 2
2 AGroup High 1 1
3 AGroup High 2 2
4 BGroup High 1 1
5 BGroup Low 1 1
6 BGroup High 1 1
7 BGroup Low 1 1
8 CGroup Low 1 1
9 CGroup Low 2 2
10 CGroup High 1 1
11 CGroup High 2 2
Usually, you would use cumsum
on the row differences to identify continuous blocks. Then you can group by the Group
and the Block
then cumcount
:
# assuming data is sorted by Group as in the example
blocks = df['Status'].ne(df['Status'].shift()).cumsum()
df['CountByGroup'] = df.groupby(['Group', blocks]).cumcount() + 1
Note if the data is not sorted by Group
, you would need to sort before creating the blocks:
blocks = df['Status'].ne(df.sort_values('Group', kind='stable')['Status'].shift()).cumsum()
or a groupby:
blocks = df['Status'].ne(df.groupby('Group')['Status'].shift()).cumsum()
Output:
Group Status CountByGroup
0 AGroup Low 1
1 AGroup Low 2
2 AGroup High 1
3 AGroup High 2
4 BGroup High 1
5 BGroup Low 1
6 BGroup High 1
7 BGroup Low 1
8 CGroup Low 1
9 CGroup Low 2
10 CGroup High 1
11 CGroup High 2
import pandas as pd
data = {'Group':["AGroup", "AGroup", "AGroup", "AGroup", "BGroup", "BGroup", "BGroup", "BGroup", "CGroup", "CGroup", "CGroup", "CGroup"],
'Status':["Low", "Low", "High", "High", "High", "Low", "High", "Low", "Low", "Low", "High", "High"],
'CountByGroup':[1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 2]}
df = pd.DataFrame(data)
s = df[['Group', 'Status']].groupby(['Group', 'Status']).ngroup()
s = s.groupby(lambda x: s[x]).cumcount()+1
print(pd.concat([df[['Group', 'Status']], s.rename("CountByGroup")], axis=1))
Group Status CountByGroup
0 AGroup Low 1
1 AGroup Low 2
2 AGroup High 1
3 AGroup High 2
4 BGroup High 1
5 BGroup Low 1
6 BGroup High 2
7 BGroup Low 2
8 CGroup Low 1
9 CGroup Low 2
10 CGroup High 1
11 CGroup High 2