How to fill last non-null value for each user in pandas?
Question:
I have a df with user journeys that show purchase amounts of products. Now, I want to fill the last non-null value for each user, since users do not buy every day. currently, I have:
date | user_id | purchase_value
2020-01-01 | 1 | null
2020-01-02 | 1 | 1
2020-01-03 | 1 | null
2020-01-04 | 1 | 4
2020-01-01 | 2 | 55
2020-01-02 | 2 | null
I want it to look like this:
date | user_id | purchase_value
2020-01-01 | 1 | null
2020-01-02 | 1 | 1
2020-01-03 | 1 | 1
2020-01-04 | 1 | 4
2020-01-01 | 2 | 55
2020-01-02 | 2 | 55
Explanation: For user 1, we fill 1 on 2020-01-03 since this was the last non-null value on 2020-01-02. For user 2, we fill in 55 on 2020-01-02 since this was the last non-null value on 2020-01-01.
How would I do this in pandas for each user_id and date? Also, the dates do not have to be sequential. i.e. there can be gaps in the dates, in that case always fill in the last non-null value (whenever that was).
Update:
I tried using this solution but the fill-forward does not occur as expected. It does take the next (future) date instead of the last non-null value. See img
df.groupby(['user_id'], sort=True)['purchase_amount'].apply(lambda x: x.ffill())
The first column is the actual normalized purchase amount for user=1 and the next column is the result from the formula. The first Nan should be replaced with 0.72, not 0.06.
Answers:
If you really want to ffill
only the last NaN per group you need to identify it, then replace with its ffill
:
# is the value NaN?
m1 = df['purchase_value'].isna()
# is this the last NaN of the group?
# here: is this the first NaN of the group in reverse?
m2 = m1[::-1].groupby(df['user_id']).cumsum().eq(1)
# then replace with the ffill per group
df.loc[m1&m2, 'purchase_value'] = df.groupby(['user_id'])['purchase_value'].ffill()
Output:
date user_id purchase_value
0 2020-01-01 1 NaN
1 2020-01-02 1 1.0
2 2020-01-03 1 1.0
3 2020-01-04 1 4.0
4 2020-01-01 2 55.0
5 2020-01-02 2 55.0
Another possible solution:
df['aux'] = (
df.assign(aux = pd.isna(df.purchase_value))
.groupby('user_id')['aux'].cumsum())
(df.assign(
purchase_value =
np.where((pd.isna(df.purchase_value)) & (df.aux == df.groupby('user_id')['aux']
.transform('max')), df.purchase_value.shift(1), df.purchase_value))
.drop('aux', axis = 1))
Output:
date user_id purchase_value
0 2020-01-01 1 NaN
1 2020-01-02 1 1.0
2 2020-01-03 1 1.0
3 2020-01-04 1 4.0
4 2020-01-01 2 55.0
5 2020-01-02 2 55.0
I have a df with user journeys that show purchase amounts of products. Now, I want to fill the last non-null value for each user, since users do not buy every day. currently, I have:
date | user_id | purchase_value
2020-01-01 | 1 | null
2020-01-02 | 1 | 1
2020-01-03 | 1 | null
2020-01-04 | 1 | 4
2020-01-01 | 2 | 55
2020-01-02 | 2 | null
I want it to look like this:
date | user_id | purchase_value
2020-01-01 | 1 | null
2020-01-02 | 1 | 1
2020-01-03 | 1 | 1
2020-01-04 | 1 | 4
2020-01-01 | 2 | 55
2020-01-02 | 2 | 55
Explanation: For user 1, we fill 1 on 2020-01-03 since this was the last non-null value on 2020-01-02. For user 2, we fill in 55 on 2020-01-02 since this was the last non-null value on 2020-01-01.
How would I do this in pandas for each user_id and date? Also, the dates do not have to be sequential. i.e. there can be gaps in the dates, in that case always fill in the last non-null value (whenever that was).
Update:
I tried using this solution but the fill-forward does not occur as expected. It does take the next (future) date instead of the last non-null value. See img
df.groupby(['user_id'], sort=True)['purchase_amount'].apply(lambda x: x.ffill())
The first column is the actual normalized purchase amount for user=1 and the next column is the result from the formula. The first Nan should be replaced with 0.72, not 0.06.
If you really want to ffill
only the last NaN per group you need to identify it, then replace with its ffill
:
# is the value NaN?
m1 = df['purchase_value'].isna()
# is this the last NaN of the group?
# here: is this the first NaN of the group in reverse?
m2 = m1[::-1].groupby(df['user_id']).cumsum().eq(1)
# then replace with the ffill per group
df.loc[m1&m2, 'purchase_value'] = df.groupby(['user_id'])['purchase_value'].ffill()
Output:
date user_id purchase_value
0 2020-01-01 1 NaN
1 2020-01-02 1 1.0
2 2020-01-03 1 1.0
3 2020-01-04 1 4.0
4 2020-01-01 2 55.0
5 2020-01-02 2 55.0
Another possible solution:
df['aux'] = (
df.assign(aux = pd.isna(df.purchase_value))
.groupby('user_id')['aux'].cumsum())
(df.assign(
purchase_value =
np.where((pd.isna(df.purchase_value)) & (df.aux == df.groupby('user_id')['aux']
.transform('max')), df.purchase_value.shift(1), df.purchase_value))
.drop('aux', axis = 1))
Output:
date user_id purchase_value
0 2020-01-01 1 NaN
1 2020-01-02 1 1.0
2 2020-01-03 1 1.0
3 2020-01-04 1 4.0
4 2020-01-01 2 55.0
5 2020-01-02 2 55.0