Keep last set of obs within a group with the same (most recent) date

Question

Is there a one-step way to keep only the latest observations within a "group"?

For example, I want to keep only the most recent observations for each PrimaryID-SecondaryID pair.

    PrimaryID   SecondaryID     SubAccount  Value   ReportDate
0   1   A   123     5618.48     2022-01-01
1   1   A   456     8206.23     2022-01-01
2   1   A   123     6722.05     2022-07-01
3   1   A   456     5500.53     2022-07-01
4   1   B   789     8990.75     2022-02-01
5   1   B   987     6294.63     2022-02-01
6   1   B   789     8389.60     2022-03-01
7   1   B   246     343.02  2022-03-01
8   2   X   234     4157.57     2022-02-01
9   2   X   752     8218.00     2022-02-01
10  2   X   234     6430.68     2022-03-01
11  2   X   755     7148.57     2022-03-01
12  2   Y   731     5406.63     2022-05-02
13  2   Y   480     2429.83     2022-05-02
14  2   Y   731     6251.38     2022-06-01
15  2   Y   841     8256.93     2022-06-01

This is one way to accomplish this, but it seems sloppy.

df['lastRptDt'] = df.groupby(['PrimaryID', 'SecondaryID'])['ReportDate'].transform(max)
df1 = df[(df['ReportDate']==df['lastRptDt'])]

This is the desired output:

    PrimaryID   SecondaryID     SubAccount  Value   ReportDate  lastRptDt
2   1   A   123     6722.05     2022-07-01  2022-07-01
3   1   A   456     5500.53     2022-07-01  2022-07-01
6   1   B   789     8389.60     2022-03-01  2022-03-01
7   1   B   246     343.02  2022-03-01  2022-03-01
10  2   X   234     6430.68     2022-03-01  2022-03-01
11  2   X   755     7148.57     2022-03-01  2022-03-01
14  2   Y   731     6251.38     2022-06-01  2022-06-01
15  2   Y   841     8256.93     2022-06-01  2022-06-01

Edit/Update: In a comment to this related post (Improving time on one line of python script), @rpanai asked for the output from df.to_dict() for the original and desired output.

Original Data

>>> df.to_dict()
{'PrimaryID': {0: 1,
  1: 1,
  2: 1,
  3: 1,
  4: 1,
  5: 1,
  6: 1,
  7: 1,
  8: 2,
  9: 2,
  10: 2,
  11: 2,
  12: 2,
  13: 2,
  14: 2,
  15: 2},
 'SecondaryID': {0: 'A',
  1: 'A',
  2: 'A',
  3: 'A',
  4: 'B',
  5: 'B',
  6: 'B',
  7: 'B',
  8: 'X',
  9: 'X',
  10: 'X',
  11: 'X',
  12: 'Y',
  13: 'Y',
  14: 'Y',
  15: 'Y'},
 'SubAccount': {0: 123,
  1: 456,
  2: 123,
  3: 456,
  4: 789,
  5: 987,
  6: 789,
  7: 246,
  8: 234,
  9: 752,
  10: 234,
  11: 755,
  12: 731,
  13: 480,
  14: 731,
  15: 841},
 'Value': {0: 5618.48,
  1: 8206.23,
  2: 6722.05,
  3: 5500.53,
  4: 8990.75,
  5: 6294.63,
  6: 8389.6,
  7: 343.02,
  8: 4157.57,
  9: 8218.0,
  10: 6430.68,
  11: 7148.57,
  12: 5406.63,
  13: 2429.83,
  14: 6251.38,
  15: 8256.93},
 'ReportDate': {0: Timestamp('2022-01-01 00:00:00'),
  1: Timestamp('2022-01-01 00:00:00'),
  2: Timestamp('2022-07-01 00:00:00'),
  3: Timestamp('2022-07-01 00:00:00'),
  4: Timestamp('2022-02-01 00:00:00'),
  5: Timestamp('2022-02-01 00:00:00'),
  6: Timestamp('2022-03-01 00:00:00'),
  7: Timestamp('2022-03-01 00:00:00'),
  8: Timestamp('2022-02-01 00:00:00'),
  9: Timestamp('2022-02-01 00:00:00'),
  10: Timestamp('2022-03-01 00:00:00'),
  11: Timestamp('2022-03-01 00:00:00'),
  12: Timestamp('2022-05-02 00:00:00'),
  13: Timestamp('2022-05-02 00:00:00'),
  14: Timestamp('2022-06-01 00:00:00'),
  15: Timestamp('2022-06-01 00:00:00')}}

Desired Dataset

>>> df1.to_dict()
{'PrimaryID': {2: 1, 3: 1, 6: 1, 7: 1, 10: 2, 11: 2, 14: 2, 15: 2},
 'SecondaryID': {2: 'A',
  3: 'A',
  6: 'B',
  7: 'B',
  10: 'X',
  11: 'X',
  14: 'Y',
  15: 'Y'},
 'SubAccount': {2: 123,
  3: 456,
  6: 789,
  7: 246,
  10: 234,
  11: 755,
  14: 731,
  15: 841},
 'Value': {2: 6722.05,
  3: 5500.53,
  6: 8389.6,
  7: 343.02,
  10: 6430.68,
  11: 7148.57,
  14: 6251.38,
  15: 8256.93},
 'ReportDate': {2: Timestamp('2022-07-01 00:00:00'),
  3: Timestamp('2022-07-01 00:00:00'),
  6: Timestamp('2022-03-01 00:00:00'),
  7: Timestamp('2022-03-01 00:00:00'),
  10: Timestamp('2022-03-01 00:00:00'),
  11: Timestamp('2022-03-01 00:00:00'),
  14: Timestamp('2022-06-01 00:00:00'),
  15: Timestamp('2022-06-01 00:00:00')},
 'lastRptDt': {2: Timestamp('2022-07-01 00:00:00'),
  3: Timestamp('2022-07-01 00:00:00'),
  6: Timestamp('2022-03-01 00:00:00'),
  7: Timestamp('2022-03-01 00:00:00'),
  10: Timestamp('2022-03-01 00:00:00'),
  11: Timestamp('2022-03-01 00:00:00'),
  14: Timestamp('2022-06-01 00:00:00'),
  15: Timestamp('2022-06-01 00:00:00')}}

Asked By: Misha

||

Source

Answer 1

How about this?

df.set_index(['PrimaryID', 'SecondaryID', 'ReportDate']).loc[:,:,df.groupby(['PrimaryID', 'SecondaryID']).ReportDate.max()]

Out[54]: 
                                  SubAccount    Value  lastRptDt
PrimaryID SecondaryID ReportDate                                
1         A           2022-07-01         123  6722.05 2022-07-01
                      2022-07-01         456  5500.53 2022-07-01
          B           2022-03-01         789  8389.60 2022-03-01
                      2022-03-01         246   343.02 2022-03-01
2         X           2022-03-01         234  6430.68 2022-03-01
                      2022-03-01         755  7148.57 2022-03-01
          Y           2022-06-01         731  6251.38 2022-06-01
                      2022-06-01         841  8256.93 2022-06-01

To keep your index you can use:


df.reset_index()
.set_index(['PrimaryID', 'SecondaryID', 'ReportDate'], drop=False)
.loc[:,:,df.groupby(['PrimaryID', 'SecondaryID']).ReportDate.max()]
.set_index('index')


Out[49]: 
       PrimaryID SecondaryID  ReportDate  SubAccount    Value
index                                                        
2              1           A  2022-07-01         123  6722.05
3              1           A  2022-07-01         456  5500.53
6              1           B  2022-03-01         789  8389.60
7              1           B  2022-03-01         246   343.02
10             2           X  2022-03-01         234  6430.68
11             2           X  2022-03-01         755  7148.57
14             2           Y  2022-06-01         731  6251.38
15             2           Y  2022-06-01         841  8256.93

The reset_index() will append the original index as a column, that you can reset back as the index once you are done. Note set_index(.., drop=False) is required in the first use to make the second show those columns.

Answered By: aetherqi

Keep last set of obs within a group with the same (most recent) date

Question:

Answers: