New column based on last time row value equals some numbers in Pandas dataframe
Question:
I have a dataframe sorted in descending order date that records the Rank of students in class and the predicted score.
Date Student_ID Rank Predicted_Score
4/7/2021 33 2 87
13/6/2021 33 4 88
31/3/2021 33 7 88
28/2/2021 33 2 86
14/2/2021 33 10 86
31/1/2021 33 8 86
23/12/2020 33 1 81
8/11/2020 33 3 80
21/10/2020 33 3 80
23/9/2020 33 4 80
20/5/2020 33 3 80
29/4/2020 33 4 80
15/4/2020 33 2 79
26/2/2020 33 3 79
12/2/2020 33 5 79
29/1/2020 33 1 70
I want to create a column called Recent_Predicted_Score
that record the last predicted_score where that student actually ranks top 3. So the desired outcome looks like
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
4/7/2021 33 2 87 86
13/6/2021 33 4 88 86
31/3/2021 33 7 88 86
28/2/2021 33 2 86 81
14/2/2021 33 10 86 81
31/1/2021 33 8 86 81
23/12/2020 33 1 81 80
8/11/2020 33 3 80 80
21/10/2020 33 3 80 80
23/9/2020 33 4 80 80
20/5/2020 33 3 80 79
29/4/2020 33 4 80 79
15/4/2020 33 2 79 79
26/2/2020 33 3 79 70
12/2/2020 33 5 79 70
29/1/2020 33 1 70
Here’s what I have tried but it doesn’t quite work, not sure if I am on the right track:
df.sort_values(by = ['Student_ID', 'Date'], ascending = [True, False], inplace = True)
lp1 = df['Predicted_Score'].where(df['Rank'].isin([1,2,3])).groupby(df['Student_ID']).bfill()
lp2 = df.groupby(['Student_ID', 'Rank'])['Predicted_Score'].shift(-1)
df = df.assign(Recent_Predicted_Score=lp1.mask(df['Rank'].isin([1,2,3]), lp2))
Thanks in advance.
Answers:
Try:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df = df.sort_values(['Student_ID', 'Date'])
df['Recent_Predicted_Score'] = np.where(df['Rank'].isin([1, 2, 3]), df['Predicted_Score'], np.nan)
df['Recent_Predicted_Score'] = df.groupby('Student_ID', group_keys=False)['Recent_Predicted_Score'].apply(lambda x: x.ffill().shift().fillna(''))
df = df.sort_values(['Student_ID', 'Date'], ascending = [True, False])
print(df)
Prints:
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
0 2021-07-04 33 2 87 86.0
1 2021-06-13 33 4 88 86.0
2 2021-03-31 33 7 88 86.0
3 2021-02-28 33 2 86 81.0
4 2021-02-14 33 10 86 81.0
5 2021-01-31 33 8 86 81.0
6 2020-12-23 33 1 81 80.0
7 2020-11-08 33 3 80 80.0
8 2020-10-21 33 3 80 80.0
9 2020-09-23 33 4 80 80.0
10 2020-05-20 33 3 80 79.0
11 2020-04-29 33 4 80 79.0
12 2020-04-15 33 2 79 79.0
13 2020-02-26 33 3 79 70.0
14 2020-02-12 33 5 79 70.0
15 2020-01-29 33 1 70
Let’s assume:
- there may be more than one unique
Student_ID
- the rows are ordered by descending
Date
as indicated by OP, but may not be ordered by Student_ID
- we want to preserve the index of the original dataframe
Subject to these assumptions, here’s a way to do what your question asks:
df['Recent_Predicted_Score'] = df.loc[df.Rank <= 3, 'Predicted_Score']
df['Recent_Predicted_Score'] = ( df
.groupby('Student_ID', sort=False)
.apply(lambda group: group.shift(-1).bfill())
['Recent_Predicted_Score'] )
Explanation:
- create a new column
Recent_Predicted_Score
containing the PredictedScore
where Rank
is in the top 3 and NaN otherwise
- use
groupby()
on Student_ID
with the sort
argument set to False for better performance (note that groupby()
preserves the order of rows within each group, specifically, not influencing the existing descending order by Date
)
- within each group, do
shift(-1)
and bfill()
to get the desired result for Recent_Predicted_Score
.
Sample input (with two distinct Student_ID
values):
Date Student_ID Rank Predicted_Score
0 2021-07-04 33 2 87
1 2021-07-04 66 2 87
2 2021-06-13 33 4 88
3 2021-06-13 66 4 88
4 2021-03-31 33 7 88
5 2021-03-31 66 7 88
6 2021-02-28 33 2 86
7 2021-02-28 66 2 86
8 2021-02-14 33 10 86
9 2021-02-14 66 10 86
10 2021-01-31 33 8 86
11 2021-01-31 66 8 86
12 2020-12-23 33 1 81
13 2020-12-23 66 1 81
14 2020-11-08 33 3 80
15 2020-11-08 66 3 80
16 2020-10-21 33 3 80
17 2020-10-21 66 3 80
18 2020-09-23 33 4 80
19 2020-09-23 66 4 80
20 2020-05-20 33 3 80
21 2020-05-20 66 3 80
22 2020-04-29 33 4 80
23 2020-04-29 66 4 80
24 2020-04-15 33 2 79
25 2020-04-15 66 2 79
26 2020-02-26 33 3 79
27 2020-02-26 66 3 79
28 2020-02-12 33 5 79
29 2020-02-12 66 5 79
30 2020-01-29 33 1 70
31 2020-01-29 66 1 70
Output:
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
0 2021-07-04 33 2 87 86.0
1 2021-07-04 66 2 87 86.0
2 2021-06-13 33 4 88 86.0
3 2021-06-13 66 4 88 86.0
4 2021-03-31 33 7 88 86.0
5 2021-03-31 66 7 88 86.0
6 2021-02-28 33 2 86 81.0
7 2021-02-28 66 2 86 81.0
8 2021-02-14 33 10 86 81.0
9 2021-02-14 66 10 86 81.0
10 2021-01-31 33 8 86 81.0
11 2021-01-31 66 8 86 81.0
12 2020-12-23 33 1 81 80.0
13 2020-12-23 66 1 81 80.0
14 2020-11-08 33 3 80 80.0
15 2020-11-08 66 3 80 80.0
16 2020-10-21 33 3 80 80.0
17 2020-10-21 66 3 80 80.0
18 2020-09-23 33 4 80 80.0
19 2020-09-23 66 4 80 80.0
20 2020-05-20 33 3 80 79.0
21 2020-05-20 66 3 80 79.0
22 2020-04-29 33 4 80 79.0
23 2020-04-29 66 4 80 79.0
24 2020-04-15 33 2 79 79.0
25 2020-04-15 66 2 79 79.0
26 2020-02-26 33 3 79 70.0
27 2020-02-26 66 3 79 70.0
28 2020-02-12 33 5 79 70.0
29 2020-02-12 66 5 79 70.0
30 2020-01-29 33 1 70 NaN
31 2020-01-29 66 1 70 NaN
Output sorted by Student_ID, Date
for easier inspection:
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
0 2021-07-04 33 2 87 86.0
2 2021-06-13 33 4 88 86.0
4 2021-03-31 33 7 88 86.0
6 2021-02-28 33 2 86 81.0
8 2021-02-14 33 10 86 81.0
10 2021-01-31 33 8 86 81.0
12 2020-12-23 33 1 81 80.0
14 2020-11-08 33 3 80 80.0
16 2020-10-21 33 3 80 80.0
18 2020-09-23 33 4 80 80.0
20 2020-05-20 33 3 80 79.0
22 2020-04-29 33 4 80 79.0
24 2020-04-15 33 2 79 79.0
26 2020-02-26 33 3 79 70.0
28 2020-02-12 33 5 79 70.0
30 2020-01-29 33 1 70 NaN
1 2021-07-04 66 2 87 86.0
3 2021-06-13 66 4 88 86.0
5 2021-03-31 66 7 88 86.0
7 2021-02-28 66 2 86 81.0
9 2021-02-14 66 10 86 81.0
11 2021-01-31 66 8 86 81.0
13 2020-12-23 66 1 81 80.0
15 2020-11-08 66 3 80 80.0
17 2020-10-21 66 3 80 80.0
19 2020-09-23 66 4 80 80.0
21 2020-05-20 66 3 80 79.0
23 2020-04-29 66 4 80 79.0
25 2020-04-15 66 2 79 79.0
27 2020-02-26 66 3 79 70.0
29 2020-02-12 66 5 79 70.0
31 2020-01-29 66 1 70 NaN
Mask the scores where rank is greater than 3 then group the masked column by Student_ID and backward fill to propagate the last predicted score
c = 'Recent_Predicted_Score'
df[c] = df['Predicted_Score'].mask(df['Rank'].gt(3))
df[c] = df.groupby('Student_ID')[c].apply(lambda s: s.shift(-1).bfill())
Result
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
0 4/7/2021 33 2 87 86.0
1 13/6/2021 33 4 88 86.0
2 31/3/2021 33 7 88 86.0
3 28/2/2021 33 2 86 81.0
4 14/2/2021 33 10 86 81.0
5 31/1/2021 33 8 86 81.0
6 23/12/2020 33 1 81 80.0
7 8/11/2020 33 3 80 80.0
8 21/10/2020 33 3 80 80.0
9 23/9/2020 33 4 80 80.0
10 20/5/2020 33 3 80 79.0
11 29/4/2020 33 4 80 79.0
12 15/4/2020 33 2 79 79.0
13 26/2/2020 33 3 79 70.0
14 12/2/2020 33 5 79 70.0
15 29/1/2020 33 1 70 NaN
Note: Make sure your dataframe is sorted on Date in descending order.
I have a dataframe sorted in descending order date that records the Rank of students in class and the predicted score.
Date Student_ID Rank Predicted_Score
4/7/2021 33 2 87
13/6/2021 33 4 88
31/3/2021 33 7 88
28/2/2021 33 2 86
14/2/2021 33 10 86
31/1/2021 33 8 86
23/12/2020 33 1 81
8/11/2020 33 3 80
21/10/2020 33 3 80
23/9/2020 33 4 80
20/5/2020 33 3 80
29/4/2020 33 4 80
15/4/2020 33 2 79
26/2/2020 33 3 79
12/2/2020 33 5 79
29/1/2020 33 1 70
I want to create a column called Recent_Predicted_Score
that record the last predicted_score where that student actually ranks top 3. So the desired outcome looks like
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
4/7/2021 33 2 87 86
13/6/2021 33 4 88 86
31/3/2021 33 7 88 86
28/2/2021 33 2 86 81
14/2/2021 33 10 86 81
31/1/2021 33 8 86 81
23/12/2020 33 1 81 80
8/11/2020 33 3 80 80
21/10/2020 33 3 80 80
23/9/2020 33 4 80 80
20/5/2020 33 3 80 79
29/4/2020 33 4 80 79
15/4/2020 33 2 79 79
26/2/2020 33 3 79 70
12/2/2020 33 5 79 70
29/1/2020 33 1 70
Here’s what I have tried but it doesn’t quite work, not sure if I am on the right track:
df.sort_values(by = ['Student_ID', 'Date'], ascending = [True, False], inplace = True)
lp1 = df['Predicted_Score'].where(df['Rank'].isin([1,2,3])).groupby(df['Student_ID']).bfill()
lp2 = df.groupby(['Student_ID', 'Rank'])['Predicted_Score'].shift(-1)
df = df.assign(Recent_Predicted_Score=lp1.mask(df['Rank'].isin([1,2,3]), lp2))
Thanks in advance.
Try:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df = df.sort_values(['Student_ID', 'Date'])
df['Recent_Predicted_Score'] = np.where(df['Rank'].isin([1, 2, 3]), df['Predicted_Score'], np.nan)
df['Recent_Predicted_Score'] = df.groupby('Student_ID', group_keys=False)['Recent_Predicted_Score'].apply(lambda x: x.ffill().shift().fillna(''))
df = df.sort_values(['Student_ID', 'Date'], ascending = [True, False])
print(df)
Prints:
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
0 2021-07-04 33 2 87 86.0
1 2021-06-13 33 4 88 86.0
2 2021-03-31 33 7 88 86.0
3 2021-02-28 33 2 86 81.0
4 2021-02-14 33 10 86 81.0
5 2021-01-31 33 8 86 81.0
6 2020-12-23 33 1 81 80.0
7 2020-11-08 33 3 80 80.0
8 2020-10-21 33 3 80 80.0
9 2020-09-23 33 4 80 80.0
10 2020-05-20 33 3 80 79.0
11 2020-04-29 33 4 80 79.0
12 2020-04-15 33 2 79 79.0
13 2020-02-26 33 3 79 70.0
14 2020-02-12 33 5 79 70.0
15 2020-01-29 33 1 70
Let’s assume:
- there may be more than one unique
Student_ID
- the rows are ordered by descending
Date
as indicated by OP, but may not be ordered byStudent_ID
- we want to preserve the index of the original dataframe
Subject to these assumptions, here’s a way to do what your question asks:
df['Recent_Predicted_Score'] = df.loc[df.Rank <= 3, 'Predicted_Score']
df['Recent_Predicted_Score'] = ( df
.groupby('Student_ID', sort=False)
.apply(lambda group: group.shift(-1).bfill())
['Recent_Predicted_Score'] )
Explanation:
- create a new column
Recent_Predicted_Score
containing thePredictedScore
whereRank
is in the top 3 and NaN otherwise - use
groupby()
onStudent_ID
with thesort
argument set to False for better performance (note thatgroupby()
preserves the order of rows within each group, specifically, not influencing the existing descending order byDate
) - within each group, do
shift(-1)
andbfill()
to get the desired result forRecent_Predicted_Score
.
Sample input (with two distinct Student_ID
values):
Date Student_ID Rank Predicted_Score
0 2021-07-04 33 2 87
1 2021-07-04 66 2 87
2 2021-06-13 33 4 88
3 2021-06-13 66 4 88
4 2021-03-31 33 7 88
5 2021-03-31 66 7 88
6 2021-02-28 33 2 86
7 2021-02-28 66 2 86
8 2021-02-14 33 10 86
9 2021-02-14 66 10 86
10 2021-01-31 33 8 86
11 2021-01-31 66 8 86
12 2020-12-23 33 1 81
13 2020-12-23 66 1 81
14 2020-11-08 33 3 80
15 2020-11-08 66 3 80
16 2020-10-21 33 3 80
17 2020-10-21 66 3 80
18 2020-09-23 33 4 80
19 2020-09-23 66 4 80
20 2020-05-20 33 3 80
21 2020-05-20 66 3 80
22 2020-04-29 33 4 80
23 2020-04-29 66 4 80
24 2020-04-15 33 2 79
25 2020-04-15 66 2 79
26 2020-02-26 33 3 79
27 2020-02-26 66 3 79
28 2020-02-12 33 5 79
29 2020-02-12 66 5 79
30 2020-01-29 33 1 70
31 2020-01-29 66 1 70
Output:
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
0 2021-07-04 33 2 87 86.0
1 2021-07-04 66 2 87 86.0
2 2021-06-13 33 4 88 86.0
3 2021-06-13 66 4 88 86.0
4 2021-03-31 33 7 88 86.0
5 2021-03-31 66 7 88 86.0
6 2021-02-28 33 2 86 81.0
7 2021-02-28 66 2 86 81.0
8 2021-02-14 33 10 86 81.0
9 2021-02-14 66 10 86 81.0
10 2021-01-31 33 8 86 81.0
11 2021-01-31 66 8 86 81.0
12 2020-12-23 33 1 81 80.0
13 2020-12-23 66 1 81 80.0
14 2020-11-08 33 3 80 80.0
15 2020-11-08 66 3 80 80.0
16 2020-10-21 33 3 80 80.0
17 2020-10-21 66 3 80 80.0
18 2020-09-23 33 4 80 80.0
19 2020-09-23 66 4 80 80.0
20 2020-05-20 33 3 80 79.0
21 2020-05-20 66 3 80 79.0
22 2020-04-29 33 4 80 79.0
23 2020-04-29 66 4 80 79.0
24 2020-04-15 33 2 79 79.0
25 2020-04-15 66 2 79 79.0
26 2020-02-26 33 3 79 70.0
27 2020-02-26 66 3 79 70.0
28 2020-02-12 33 5 79 70.0
29 2020-02-12 66 5 79 70.0
30 2020-01-29 33 1 70 NaN
31 2020-01-29 66 1 70 NaN
Output sorted by Student_ID, Date
for easier inspection:
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
0 2021-07-04 33 2 87 86.0
2 2021-06-13 33 4 88 86.0
4 2021-03-31 33 7 88 86.0
6 2021-02-28 33 2 86 81.0
8 2021-02-14 33 10 86 81.0
10 2021-01-31 33 8 86 81.0
12 2020-12-23 33 1 81 80.0
14 2020-11-08 33 3 80 80.0
16 2020-10-21 33 3 80 80.0
18 2020-09-23 33 4 80 80.0
20 2020-05-20 33 3 80 79.0
22 2020-04-29 33 4 80 79.0
24 2020-04-15 33 2 79 79.0
26 2020-02-26 33 3 79 70.0
28 2020-02-12 33 5 79 70.0
30 2020-01-29 33 1 70 NaN
1 2021-07-04 66 2 87 86.0
3 2021-06-13 66 4 88 86.0
5 2021-03-31 66 7 88 86.0
7 2021-02-28 66 2 86 81.0
9 2021-02-14 66 10 86 81.0
11 2021-01-31 66 8 86 81.0
13 2020-12-23 66 1 81 80.0
15 2020-11-08 66 3 80 80.0
17 2020-10-21 66 3 80 80.0
19 2020-09-23 66 4 80 80.0
21 2020-05-20 66 3 80 79.0
23 2020-04-29 66 4 80 79.0
25 2020-04-15 66 2 79 79.0
27 2020-02-26 66 3 79 70.0
29 2020-02-12 66 5 79 70.0
31 2020-01-29 66 1 70 NaN
Mask the scores where rank is greater than 3 then group the masked column by Student_ID and backward fill to propagate the last predicted score
c = 'Recent_Predicted_Score'
df[c] = df['Predicted_Score'].mask(df['Rank'].gt(3))
df[c] = df.groupby('Student_ID')[c].apply(lambda s: s.shift(-1).bfill())
Result
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
0 4/7/2021 33 2 87 86.0
1 13/6/2021 33 4 88 86.0
2 31/3/2021 33 7 88 86.0
3 28/2/2021 33 2 86 81.0
4 14/2/2021 33 10 86 81.0
5 31/1/2021 33 8 86 81.0
6 23/12/2020 33 1 81 80.0
7 8/11/2020 33 3 80 80.0
8 21/10/2020 33 3 80 80.0
9 23/9/2020 33 4 80 80.0
10 20/5/2020 33 3 80 79.0
11 29/4/2020 33 4 80 79.0
12 15/4/2020 33 2 79 79.0
13 26/2/2020 33 3 79 70.0
14 12/2/2020 33 5 79 70.0
15 29/1/2020 33 1 70 NaN
Note: Make sure your dataframe is sorted on Date in descending order.