Identify non-unique rows, grouping any pairs
Question:
I am trying to figure out a non-looping way to identify (auto-incrementing int would be ideal) the non-unique groups of rows (a group can contain 1 or more rows) within each TDateID, GroupID combination.
Here is an example DataFrame that looks like
Index
Cents
SD_YF
TDateID
GroupID
10
182.5
2.1
0
0
11
182.5
2.1
0
0
12
153.5
1.05
0
1
13
153.5
1.05
0
1
14
43
11
1
2
15
43
11
1
2
4
152
21
1
2
5
152
21
1
2
My ideal output would be:
Index
Cents
SD_YF
TDateID
GroupID
UniID
10
182.5
2.1
0
0
1
11
182.5
2.1
0
0
2
12
153.5
1.05
0
1
3
13
153.5
1.05
0
1
4
14
43
11
1
2
5
15
43
11
1
2
6
4
152
21
1
2
5
5
152
21
1
2
6
I have bolded #5 to draw attention to how index 14, 4 are paired together. Similar with #6. I hope that makes sense!
Answers:
IIUC you need to add the group number + the cumcount per duplicate + 1:
df['UniID'] = (df['GroupID']
+df.groupby('GroupID').ngroup().add(1)
+df.groupby(['GroupID', 'Cents', 'SD_YF']).cumcount()
)
output:
Index Cents SD_YF GroupID UniID
0 10 182.5 2.10 0 1
1 11 182.5 2.10 0 2
2 12 153.5 1.05 1 3
3 13 153.5 1.05 1 4
4 14 43.0 11.00 2 5
5 15 43.0 11.00 2 6
6 4 152.0 21.00 2 5
7 5 152.0 21.00 2 6
I am trying to figure out a non-looping way to identify (auto-incrementing int would be ideal) the non-unique groups of rows (a group can contain 1 or more rows) within each TDateID, GroupID combination.
Here is an example DataFrame that looks like
Index | Cents | SD_YF | TDateID | GroupID |
---|---|---|---|---|
10 | 182.5 | 2.1 | 0 | 0 |
11 | 182.5 | 2.1 | 0 | 0 |
12 | 153.5 | 1.05 | 0 | 1 |
13 | 153.5 | 1.05 | 0 | 1 |
14 | 43 | 11 | 1 | 2 |
15 | 43 | 11 | 1 | 2 |
4 | 152 | 21 | 1 | 2 |
5 | 152 | 21 | 1 | 2 |
My ideal output would be:
Index | Cents | SD_YF | TDateID | GroupID | UniID |
---|---|---|---|---|---|
10 | 182.5 | 2.1 | 0 | 0 | 1 |
11 | 182.5 | 2.1 | 0 | 0 | 2 |
12 | 153.5 | 1.05 | 0 | 1 | 3 |
13 | 153.5 | 1.05 | 0 | 1 | 4 |
14 | 43 | 11 | 1 | 2 | 5 |
15 | 43 | 11 | 1 | 2 | 6 |
4 | 152 | 21 | 1 | 2 | 5 |
5 | 152 | 21 | 1 | 2 | 6 |
I have bolded #5 to draw attention to how index 14, 4 are paired together. Similar with #6. I hope that makes sense!
IIUC you need to add the group number + the cumcount per duplicate + 1:
df['UniID'] = (df['GroupID']
+df.groupby('GroupID').ngroup().add(1)
+df.groupby(['GroupID', 'Cents', 'SD_YF']).cumcount()
)
output:
Index Cents SD_YF GroupID UniID
0 10 182.5 2.10 0 1
1 11 182.5 2.10 0 2
2 12 153.5 1.05 1 3
3 13 153.5 1.05 1 4
4 14 43.0 11.00 2 5
5 15 43.0 11.00 2 6
6 4 152.0 21.00 2 5
7 5 152.0 21.00 2 6