Keep last set of obs within a group with the same (most recent) date
Question:
Is there a one-step way to keep only the latest observations within a "group"?
For example, I want to keep only the most recent observations for each PrimaryID-SecondaryID pair.
PrimaryID SecondaryID SubAccount Value ReportDate
0 1 A 123 5618.48 2022-01-01
1 1 A 456 8206.23 2022-01-01
2 1 A 123 6722.05 2022-07-01
3 1 A 456 5500.53 2022-07-01
4 1 B 789 8990.75 2022-02-01
5 1 B 987 6294.63 2022-02-01
6 1 B 789 8389.60 2022-03-01
7 1 B 246 343.02 2022-03-01
8 2 X 234 4157.57 2022-02-01
9 2 X 752 8218.00 2022-02-01
10 2 X 234 6430.68 2022-03-01
11 2 X 755 7148.57 2022-03-01
12 2 Y 731 5406.63 2022-05-02
13 2 Y 480 2429.83 2022-05-02
14 2 Y 731 6251.38 2022-06-01
15 2 Y 841 8256.93 2022-06-01
This is one way to accomplish this, but it seems sloppy.
df['lastRptDt'] = df.groupby(['PrimaryID', 'SecondaryID'])['ReportDate'].transform(max)
df1 = df[(df['ReportDate']==df['lastRptDt'])]
This is the desired output:
PrimaryID SecondaryID SubAccount Value ReportDate lastRptDt
2 1 A 123 6722.05 2022-07-01 2022-07-01
3 1 A 456 5500.53 2022-07-01 2022-07-01
6 1 B 789 8389.60 2022-03-01 2022-03-01
7 1 B 246 343.02 2022-03-01 2022-03-01
10 2 X 234 6430.68 2022-03-01 2022-03-01
11 2 X 755 7148.57 2022-03-01 2022-03-01
14 2 Y 731 6251.38 2022-06-01 2022-06-01
15 2 Y 841 8256.93 2022-06-01 2022-06-01
Edit/Update: In a comment to this related post (Improving time on one line of python script), @rpanai asked for the output from df.to_dict() for the original and desired output.
Original Data
>>> df.to_dict()
{'PrimaryID': {0: 1,
1: 1,
2: 1,
3: 1,
4: 1,
5: 1,
6: 1,
7: 1,
8: 2,
9: 2,
10: 2,
11: 2,
12: 2,
13: 2,
14: 2,
15: 2},
'SecondaryID': {0: 'A',
1: 'A',
2: 'A',
3: 'A',
4: 'B',
5: 'B',
6: 'B',
7: 'B',
8: 'X',
9: 'X',
10: 'X',
11: 'X',
12: 'Y',
13: 'Y',
14: 'Y',
15: 'Y'},
'SubAccount': {0: 123,
1: 456,
2: 123,
3: 456,
4: 789,
5: 987,
6: 789,
7: 246,
8: 234,
9: 752,
10: 234,
11: 755,
12: 731,
13: 480,
14: 731,
15: 841},
'Value': {0: 5618.48,
1: 8206.23,
2: 6722.05,
3: 5500.53,
4: 8990.75,
5: 6294.63,
6: 8389.6,
7: 343.02,
8: 4157.57,
9: 8218.0,
10: 6430.68,
11: 7148.57,
12: 5406.63,
13: 2429.83,
14: 6251.38,
15: 8256.93},
'ReportDate': {0: Timestamp('2022-01-01 00:00:00'),
1: Timestamp('2022-01-01 00:00:00'),
2: Timestamp('2022-07-01 00:00:00'),
3: Timestamp('2022-07-01 00:00:00'),
4: Timestamp('2022-02-01 00:00:00'),
5: Timestamp('2022-02-01 00:00:00'),
6: Timestamp('2022-03-01 00:00:00'),
7: Timestamp('2022-03-01 00:00:00'),
8: Timestamp('2022-02-01 00:00:00'),
9: Timestamp('2022-02-01 00:00:00'),
10: Timestamp('2022-03-01 00:00:00'),
11: Timestamp('2022-03-01 00:00:00'),
12: Timestamp('2022-05-02 00:00:00'),
13: Timestamp('2022-05-02 00:00:00'),
14: Timestamp('2022-06-01 00:00:00'),
15: Timestamp('2022-06-01 00:00:00')}}
Desired Dataset
>>> df1.to_dict()
{'PrimaryID': {2: 1, 3: 1, 6: 1, 7: 1, 10: 2, 11: 2, 14: 2, 15: 2},
'SecondaryID': {2: 'A',
3: 'A',
6: 'B',
7: 'B',
10: 'X',
11: 'X',
14: 'Y',
15: 'Y'},
'SubAccount': {2: 123,
3: 456,
6: 789,
7: 246,
10: 234,
11: 755,
14: 731,
15: 841},
'Value': {2: 6722.05,
3: 5500.53,
6: 8389.6,
7: 343.02,
10: 6430.68,
11: 7148.57,
14: 6251.38,
15: 8256.93},
'ReportDate': {2: Timestamp('2022-07-01 00:00:00'),
3: Timestamp('2022-07-01 00:00:00'),
6: Timestamp('2022-03-01 00:00:00'),
7: Timestamp('2022-03-01 00:00:00'),
10: Timestamp('2022-03-01 00:00:00'),
11: Timestamp('2022-03-01 00:00:00'),
14: Timestamp('2022-06-01 00:00:00'),
15: Timestamp('2022-06-01 00:00:00')},
'lastRptDt': {2: Timestamp('2022-07-01 00:00:00'),
3: Timestamp('2022-07-01 00:00:00'),
6: Timestamp('2022-03-01 00:00:00'),
7: Timestamp('2022-03-01 00:00:00'),
10: Timestamp('2022-03-01 00:00:00'),
11: Timestamp('2022-03-01 00:00:00'),
14: Timestamp('2022-06-01 00:00:00'),
15: Timestamp('2022-06-01 00:00:00')}}
Answers:
How about this?
df.set_index(['PrimaryID', 'SecondaryID', 'ReportDate']).loc[:,:,df.groupby(['PrimaryID', 'SecondaryID']).ReportDate.max()]
Out[54]:
SubAccount Value lastRptDt
PrimaryID SecondaryID ReportDate
1 A 2022-07-01 123 6722.05 2022-07-01
2022-07-01 456 5500.53 2022-07-01
B 2022-03-01 789 8389.60 2022-03-01
2022-03-01 246 343.02 2022-03-01
2 X 2022-03-01 234 6430.68 2022-03-01
2022-03-01 755 7148.57 2022-03-01
Y 2022-06-01 731 6251.38 2022-06-01
2022-06-01 841 8256.93 2022-06-01
To keep your index you can use:
df.reset_index()
.set_index(['PrimaryID', 'SecondaryID', 'ReportDate'], drop=False)
.loc[:,:,df.groupby(['PrimaryID', 'SecondaryID']).ReportDate.max()]
.set_index('index')
Out[49]:
PrimaryID SecondaryID ReportDate SubAccount Value
index
2 1 A 2022-07-01 123 6722.05
3 1 A 2022-07-01 456 5500.53
6 1 B 2022-03-01 789 8389.60
7 1 B 2022-03-01 246 343.02
10 2 X 2022-03-01 234 6430.68
11 2 X 2022-03-01 755 7148.57
14 2 Y 2022-06-01 731 6251.38
15 2 Y 2022-06-01 841 8256.93
The reset_index()
will append the original index as a column, that you can reset back as the index once you are done. Note set_index(.., drop=False)
is required in the first use to make the second show those columns.
Is there a one-step way to keep only the latest observations within a "group"?
For example, I want to keep only the most recent observations for each PrimaryID-SecondaryID pair.
PrimaryID SecondaryID SubAccount Value ReportDate
0 1 A 123 5618.48 2022-01-01
1 1 A 456 8206.23 2022-01-01
2 1 A 123 6722.05 2022-07-01
3 1 A 456 5500.53 2022-07-01
4 1 B 789 8990.75 2022-02-01
5 1 B 987 6294.63 2022-02-01
6 1 B 789 8389.60 2022-03-01
7 1 B 246 343.02 2022-03-01
8 2 X 234 4157.57 2022-02-01
9 2 X 752 8218.00 2022-02-01
10 2 X 234 6430.68 2022-03-01
11 2 X 755 7148.57 2022-03-01
12 2 Y 731 5406.63 2022-05-02
13 2 Y 480 2429.83 2022-05-02
14 2 Y 731 6251.38 2022-06-01
15 2 Y 841 8256.93 2022-06-01
This is one way to accomplish this, but it seems sloppy.
df['lastRptDt'] = df.groupby(['PrimaryID', 'SecondaryID'])['ReportDate'].transform(max)
df1 = df[(df['ReportDate']==df['lastRptDt'])]
This is the desired output:
PrimaryID SecondaryID SubAccount Value ReportDate lastRptDt
2 1 A 123 6722.05 2022-07-01 2022-07-01
3 1 A 456 5500.53 2022-07-01 2022-07-01
6 1 B 789 8389.60 2022-03-01 2022-03-01
7 1 B 246 343.02 2022-03-01 2022-03-01
10 2 X 234 6430.68 2022-03-01 2022-03-01
11 2 X 755 7148.57 2022-03-01 2022-03-01
14 2 Y 731 6251.38 2022-06-01 2022-06-01
15 2 Y 841 8256.93 2022-06-01 2022-06-01
Edit/Update: In a comment to this related post (Improving time on one line of python script), @rpanai asked for the output from df.to_dict() for the original and desired output.
Original Data
>>> df.to_dict()
{'PrimaryID': {0: 1,
1: 1,
2: 1,
3: 1,
4: 1,
5: 1,
6: 1,
7: 1,
8: 2,
9: 2,
10: 2,
11: 2,
12: 2,
13: 2,
14: 2,
15: 2},
'SecondaryID': {0: 'A',
1: 'A',
2: 'A',
3: 'A',
4: 'B',
5: 'B',
6: 'B',
7: 'B',
8: 'X',
9: 'X',
10: 'X',
11: 'X',
12: 'Y',
13: 'Y',
14: 'Y',
15: 'Y'},
'SubAccount': {0: 123,
1: 456,
2: 123,
3: 456,
4: 789,
5: 987,
6: 789,
7: 246,
8: 234,
9: 752,
10: 234,
11: 755,
12: 731,
13: 480,
14: 731,
15: 841},
'Value': {0: 5618.48,
1: 8206.23,
2: 6722.05,
3: 5500.53,
4: 8990.75,
5: 6294.63,
6: 8389.6,
7: 343.02,
8: 4157.57,
9: 8218.0,
10: 6430.68,
11: 7148.57,
12: 5406.63,
13: 2429.83,
14: 6251.38,
15: 8256.93},
'ReportDate': {0: Timestamp('2022-01-01 00:00:00'),
1: Timestamp('2022-01-01 00:00:00'),
2: Timestamp('2022-07-01 00:00:00'),
3: Timestamp('2022-07-01 00:00:00'),
4: Timestamp('2022-02-01 00:00:00'),
5: Timestamp('2022-02-01 00:00:00'),
6: Timestamp('2022-03-01 00:00:00'),
7: Timestamp('2022-03-01 00:00:00'),
8: Timestamp('2022-02-01 00:00:00'),
9: Timestamp('2022-02-01 00:00:00'),
10: Timestamp('2022-03-01 00:00:00'),
11: Timestamp('2022-03-01 00:00:00'),
12: Timestamp('2022-05-02 00:00:00'),
13: Timestamp('2022-05-02 00:00:00'),
14: Timestamp('2022-06-01 00:00:00'),
15: Timestamp('2022-06-01 00:00:00')}}
Desired Dataset
>>> df1.to_dict()
{'PrimaryID': {2: 1, 3: 1, 6: 1, 7: 1, 10: 2, 11: 2, 14: 2, 15: 2},
'SecondaryID': {2: 'A',
3: 'A',
6: 'B',
7: 'B',
10: 'X',
11: 'X',
14: 'Y',
15: 'Y'},
'SubAccount': {2: 123,
3: 456,
6: 789,
7: 246,
10: 234,
11: 755,
14: 731,
15: 841},
'Value': {2: 6722.05,
3: 5500.53,
6: 8389.6,
7: 343.02,
10: 6430.68,
11: 7148.57,
14: 6251.38,
15: 8256.93},
'ReportDate': {2: Timestamp('2022-07-01 00:00:00'),
3: Timestamp('2022-07-01 00:00:00'),
6: Timestamp('2022-03-01 00:00:00'),
7: Timestamp('2022-03-01 00:00:00'),
10: Timestamp('2022-03-01 00:00:00'),
11: Timestamp('2022-03-01 00:00:00'),
14: Timestamp('2022-06-01 00:00:00'),
15: Timestamp('2022-06-01 00:00:00')},
'lastRptDt': {2: Timestamp('2022-07-01 00:00:00'),
3: Timestamp('2022-07-01 00:00:00'),
6: Timestamp('2022-03-01 00:00:00'),
7: Timestamp('2022-03-01 00:00:00'),
10: Timestamp('2022-03-01 00:00:00'),
11: Timestamp('2022-03-01 00:00:00'),
14: Timestamp('2022-06-01 00:00:00'),
15: Timestamp('2022-06-01 00:00:00')}}
How about this?
df.set_index(['PrimaryID', 'SecondaryID', 'ReportDate']).loc[:,:,df.groupby(['PrimaryID', 'SecondaryID']).ReportDate.max()]
Out[54]:
SubAccount Value lastRptDt
PrimaryID SecondaryID ReportDate
1 A 2022-07-01 123 6722.05 2022-07-01
2022-07-01 456 5500.53 2022-07-01
B 2022-03-01 789 8389.60 2022-03-01
2022-03-01 246 343.02 2022-03-01
2 X 2022-03-01 234 6430.68 2022-03-01
2022-03-01 755 7148.57 2022-03-01
Y 2022-06-01 731 6251.38 2022-06-01
2022-06-01 841 8256.93 2022-06-01
To keep your index you can use:
df.reset_index()
.set_index(['PrimaryID', 'SecondaryID', 'ReportDate'], drop=False)
.loc[:,:,df.groupby(['PrimaryID', 'SecondaryID']).ReportDate.max()]
.set_index('index')
Out[49]:
PrimaryID SecondaryID ReportDate SubAccount Value
index
2 1 A 2022-07-01 123 6722.05
3 1 A 2022-07-01 456 5500.53
6 1 B 2022-03-01 789 8389.60
7 1 B 2022-03-01 246 343.02
10 2 X 2022-03-01 234 6430.68
11 2 X 2022-03-01 755 7148.57
14 2 Y 2022-06-01 731 6251.38
15 2 Y 2022-06-01 841 8256.93
The reset_index()
will append the original index as a column, that you can reset back as the index once you are done. Note set_index(.., drop=False)
is required in the first use to make the second show those columns.