New column based on last time row value equals some numbers in Pandas dataframe

Question:

I have a dataframe sorted in descending order date that records the Rank of students in class and the predicted score.

Date        Student_ID  Rank    Predicted_Score
4/7/2021    33          2       87
13/6/2021   33          4       88
31/3/2021   33          7       88
28/2/2021   33          2       86
14/2/2021   33          10      86
31/1/2021   33          8       86
23/12/2020  33          1       81
8/11/2020   33          3       80
21/10/2020  33          3       80
23/9/2020   33          4       80
20/5/2020   33          3       80
29/4/2020   33          4       80
15/4/2020   33          2       79
26/2/2020   33          3       79
12/2/2020   33          5       79
29/1/2020   33          1       70

I want to create a column called Recent_Predicted_Score that record the last predicted_score where that student actually ranks top 3. So the desired outcome looks like

Date        Student_ID  Rank    Predicted_Score    Recent_Predicted_Score
4/7/2021    33          2       87                 86
13/6/2021   33          4       88                 86
31/3/2021   33          7       88                 86
28/2/2021   33          2       86                 81
14/2/2021   33          10      86                 81
31/1/2021   33          8       86                 81
23/12/2020  33          1       81                 80
8/11/2020   33          3       80                 80
21/10/2020  33          3       80                 80
23/9/2020   33          4       80                 80
20/5/2020   33          3       80                 79
29/4/2020   33          4       80                 79
15/4/2020   33          2       79                 79
26/2/2020   33          3       79                 70
12/2/2020   33          5       79                 70
29/1/2020   33          1       70

Here’s what I have tried but it doesn’t quite work, not sure if I am on the right track:

df.sort_values(by = ['Student_ID', 'Date'], ascending = [True, False], inplace = True)
lp1 = df['Predicted_Score'].where(df['Rank'].isin([1,2,3])).groupby(df['Student_ID']).bfill()
lp2 = df.groupby(['Student_ID', 'Rank'])['Predicted_Score'].shift(-1)
df = df.assign(Recent_Predicted_Score=lp1.mask(df['Rank'].isin([1,2,3]), lp2))

Thanks in advance.

Asked By: Nayr borcherds

||

Answers:

Try:

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

df = df.sort_values(['Student_ID', 'Date'])

df['Recent_Predicted_Score'] = np.where(df['Rank'].isin([1, 2, 3]), df['Predicted_Score'], np.nan)
df['Recent_Predicted_Score'] = df.groupby('Student_ID', group_keys=False)['Recent_Predicted_Score'].apply(lambda x: x.ffill().shift().fillna(''))

df = df.sort_values(['Student_ID', 'Date'], ascending = [True, False])
print(df)

Prints:

         Date  Student_ID  Rank  Predicted_Score Recent_Predicted_Score
0  2021-07-04          33     2               87                   86.0
1  2021-06-13          33     4               88                   86.0
2  2021-03-31          33     7               88                   86.0
3  2021-02-28          33     2               86                   81.0
4  2021-02-14          33    10               86                   81.0
5  2021-01-31          33     8               86                   81.0
6  2020-12-23          33     1               81                   80.0
7  2020-11-08          33     3               80                   80.0
8  2020-10-21          33     3               80                   80.0
9  2020-09-23          33     4               80                   80.0
10 2020-05-20          33     3               80                   79.0
11 2020-04-29          33     4               80                   79.0
12 2020-04-15          33     2               79                   79.0
13 2020-02-26          33     3               79                   70.0
14 2020-02-12          33     5               79                   70.0
15 2020-01-29          33     1               70                       
Answered By: Andrej Kesely

Let’s assume:

  • there may be more than one unique Student_ID
  • the rows are ordered by descending Date as indicated by OP, but may not be ordered by Student_ID
  • we want to preserve the index of the original dataframe

Subject to these assumptions, here’s a way to do what your question asks:

df['Recent_Predicted_Score'] = df.loc[df.Rank <= 3, 'Predicted_Score']
df['Recent_Predicted_Score'] = ( df
    .groupby('Student_ID', sort=False)
    .apply(lambda group: group.shift(-1).bfill())
    ['Recent_Predicted_Score'] )

Explanation:

  • create a new column Recent_Predicted_Score containing the PredictedScore where Rank is in the top 3 and NaN otherwise
  • use groupby() on Student_ID with the sort argument set to False for better performance (note that groupby() preserves the order of rows within each group, specifically, not influencing the existing descending order by Date)
  • within each group, do shift(-1) and bfill() to get the desired result for Recent_Predicted_Score.

Sample input (with two distinct Student_ID values):

         Date  Student_ID  Rank  Predicted_Score
0  2021-07-04          33     2               87
1  2021-07-04          66     2               87
2  2021-06-13          33     4               88
3  2021-06-13          66     4               88
4  2021-03-31          33     7               88
5  2021-03-31          66     7               88
6  2021-02-28          33     2               86
7  2021-02-28          66     2               86
8  2021-02-14          33    10               86
9  2021-02-14          66    10               86
10 2021-01-31          33     8               86
11 2021-01-31          66     8               86
12 2020-12-23          33     1               81
13 2020-12-23          66     1               81
14 2020-11-08          33     3               80
15 2020-11-08          66     3               80
16 2020-10-21          33     3               80
17 2020-10-21          66     3               80
18 2020-09-23          33     4               80
19 2020-09-23          66     4               80
20 2020-05-20          33     3               80
21 2020-05-20          66     3               80
22 2020-04-29          33     4               80
23 2020-04-29          66     4               80
24 2020-04-15          33     2               79
25 2020-04-15          66     2               79
26 2020-02-26          33     3               79
27 2020-02-26          66     3               79
28 2020-02-12          33     5               79
29 2020-02-12          66     5               79
30 2020-01-29          33     1               70
31 2020-01-29          66     1               70

Output:

         Date  Student_ID  Rank  Predicted_Score  Recent_Predicted_Score
0  2021-07-04          33     2               87                    86.0
1  2021-07-04          66     2               87                    86.0
2  2021-06-13          33     4               88                    86.0
3  2021-06-13          66     4               88                    86.0
4  2021-03-31          33     7               88                    86.0
5  2021-03-31          66     7               88                    86.0
6  2021-02-28          33     2               86                    81.0
7  2021-02-28          66     2               86                    81.0
8  2021-02-14          33    10               86                    81.0
9  2021-02-14          66    10               86                    81.0
10 2021-01-31          33     8               86                    81.0
11 2021-01-31          66     8               86                    81.0
12 2020-12-23          33     1               81                    80.0
13 2020-12-23          66     1               81                    80.0
14 2020-11-08          33     3               80                    80.0
15 2020-11-08          66     3               80                    80.0
16 2020-10-21          33     3               80                    80.0
17 2020-10-21          66     3               80                    80.0
18 2020-09-23          33     4               80                    80.0
19 2020-09-23          66     4               80                    80.0
20 2020-05-20          33     3               80                    79.0
21 2020-05-20          66     3               80                    79.0
22 2020-04-29          33     4               80                    79.0
23 2020-04-29          66     4               80                    79.0
24 2020-04-15          33     2               79                    79.0
25 2020-04-15          66     2               79                    79.0
26 2020-02-26          33     3               79                    70.0
27 2020-02-26          66     3               79                    70.0
28 2020-02-12          33     5               79                    70.0
29 2020-02-12          66     5               79                    70.0
30 2020-01-29          33     1               70                     NaN
31 2020-01-29          66     1               70                     NaN

Output sorted by Student_ID, Date for easier inspection:

         Date  Student_ID  Rank  Predicted_Score  Recent_Predicted_Score
0  2021-07-04          33     2               87                    86.0
2  2021-06-13          33     4               88                    86.0
4  2021-03-31          33     7               88                    86.0
6  2021-02-28          33     2               86                    81.0
8  2021-02-14          33    10               86                    81.0
10 2021-01-31          33     8               86                    81.0
12 2020-12-23          33     1               81                    80.0
14 2020-11-08          33     3               80                    80.0
16 2020-10-21          33     3               80                    80.0
18 2020-09-23          33     4               80                    80.0
20 2020-05-20          33     3               80                    79.0
22 2020-04-29          33     4               80                    79.0
24 2020-04-15          33     2               79                    79.0
26 2020-02-26          33     3               79                    70.0
28 2020-02-12          33     5               79                    70.0
30 2020-01-29          33     1               70                     NaN
1  2021-07-04          66     2               87                    86.0
3  2021-06-13          66     4               88                    86.0
5  2021-03-31          66     7               88                    86.0
7  2021-02-28          66     2               86                    81.0
9  2021-02-14          66    10               86                    81.0
11 2021-01-31          66     8               86                    81.0
13 2020-12-23          66     1               81                    80.0
15 2020-11-08          66     3               80                    80.0
17 2020-10-21          66     3               80                    80.0
19 2020-09-23          66     4               80                    80.0
21 2020-05-20          66     3               80                    79.0
23 2020-04-29          66     4               80                    79.0
25 2020-04-15          66     2               79                    79.0
27 2020-02-26          66     3               79                    70.0
29 2020-02-12          66     5               79                    70.0
31 2020-01-29          66     1               70                     NaN
Answered By: constantstranger

Mask the scores where rank is greater than 3 then group the masked column by Student_ID and backward fill to propagate the last predicted score

c = 'Recent_Predicted_Score'
df[c] = df['Predicted_Score'].mask(df['Rank'].gt(3))
df[c] = df.groupby('Student_ID')[c].apply(lambda s: s.shift(-1).bfill())

Result

          Date  Student_ID  Rank  Predicted_Score  Recent_Predicted_Score
0     4/7/2021          33     2               87                    86.0
1    13/6/2021          33     4               88                    86.0
2    31/3/2021          33     7               88                    86.0
3    28/2/2021          33     2               86                    81.0
4    14/2/2021          33    10               86                    81.0
5    31/1/2021          33     8               86                    81.0
6   23/12/2020          33     1               81                    80.0
7    8/11/2020          33     3               80                    80.0
8   21/10/2020          33     3               80                    80.0
9    23/9/2020          33     4               80                    80.0
10   20/5/2020          33     3               80                    79.0
11   29/4/2020          33     4               80                    79.0
12   15/4/2020          33     2               79                    79.0
13   26/2/2020          33     3               79                    70.0
14   12/2/2020          33     5               79                    70.0
15   29/1/2020          33     1               70                     NaN

Note: Make sure your dataframe is sorted on Date in descending order.

Answered By: Shubham Sharma