Fill NA values over varied data frame column slices in Pandas
Question:
I have a Pandas data frame similar to the following:
pd.DataFrame({
'End' : ['2022-03','2022-05','2022-06'],
'2022-01' : [1,2,np.nan],
'2022-02' : [np.nan,3,4],
'2022-03' : [np.nan,1,3],
'2022-04' : [np.nan,np.nan,2],
'2022-05' : [np.nan,np.nan,np.nan],
'2022-06' : [np.nan,np.nan,np.nan]
})
I would like to fill the NaN values in each row such that all columns up to that listed in end
are replaced with 0 while those after remain as NaN
The desired output would be:
pd.DataFrame({
'End' : ['2022-03','2022-05','2022-06'],
'2022-01' : [1,2,0],
'2022-02' : [0,3,4],
'2022-03' : [0,1,3],
'2022-04' : [np.nan,0,2],
'2022-05' : [np.nan,0,0],
'2022-06' : [np.nan,np.nan,0]
})
Answers:
Use broadcasting to compare the months, then you can mask with where
:
df.iloc[:,1:] = df.iloc[:,1:].fillna(0).where(df['End'].to_numpy()[:,None] >= [df.columns[1:]])
Or safer when your other data is not NaN
:
df.iloc[:,1:] = np.where(df['End'].to_numpy()[:,None] >= [df.columns[1:]],
df.iloc[:,1:].fillna(0), df.iloc[:,1:])
Output:
End 2022-01 2022-02 2022-03 2022-04 2022-05 2022-06
0 2022-03 1.0 0.0 0.0 NaN NaN NaN
1 2022-05 2.0 3.0 1.0 0.0 0.0 NaN
2 2022-06 0.0 4.0 3.0 2.0 0.0 0.0
Note: It might be better setting End
as the index.
Use numpy broadcasting on the index/columns with mask
and fillna
:
mask = df['End'].to_numpy()[:, None] >= df.columns.to_numpy()
out = df.fillna(df.mask(mask, 0))
print(out)
Output:
End 2022-01 2022-02 2022-03 2022-04 2022-05 2022-06
0 2022-03 1.0 0.0 0.0 NaN NaN NaN
1 2022-05 2.0 3.0 1.0 0.0 0.0 NaN
2 2022-06 0.0 4.0 3.0 2.0 0.0 0.0
Intermediate mask
:
array([[ True, True, False, False, False, False],
[ True, True, True, True, False, False],
[ True, True, True, True, True, False]])
Probably not the most elegant solution but can be done using pd.melt
and pd.pivot
:
melt_df = df.melt(id_vars=["End"])
melt_df.loc[(melt_df["End"] >= melt_df["variable"]) & (melt_df["value"].isnull()), "value"] = 0
This makes checking your condition easier. Then you reverse back to get original df format:
final_df = melt_df.pivot(index="End", columns="variable", values="value").reset_index()
final_df.columns.name = None
End 2022-01 2022-02 2022-03 2022-04 2022-05 2022-06
0 2022-03 1.0 0.0 0.0 NaN NaN NaN
1 2022-05 2.0 3.0 1.0 0.0 0.0 NaN
2 2022-06 0.0 4.0 3.0 2.0 0.0 0.0
I have a Pandas data frame similar to the following:
pd.DataFrame({
'End' : ['2022-03','2022-05','2022-06'],
'2022-01' : [1,2,np.nan],
'2022-02' : [np.nan,3,4],
'2022-03' : [np.nan,1,3],
'2022-04' : [np.nan,np.nan,2],
'2022-05' : [np.nan,np.nan,np.nan],
'2022-06' : [np.nan,np.nan,np.nan]
})
I would like to fill the NaN values in each row such that all columns up to that listed in end
are replaced with 0 while those after remain as NaN
The desired output would be:
pd.DataFrame({
'End' : ['2022-03','2022-05','2022-06'],
'2022-01' : [1,2,0],
'2022-02' : [0,3,4],
'2022-03' : [0,1,3],
'2022-04' : [np.nan,0,2],
'2022-05' : [np.nan,0,0],
'2022-06' : [np.nan,np.nan,0]
})
Use broadcasting to compare the months, then you can mask with where
:
df.iloc[:,1:] = df.iloc[:,1:].fillna(0).where(df['End'].to_numpy()[:,None] >= [df.columns[1:]])
Or safer when your other data is not NaN
:
df.iloc[:,1:] = np.where(df['End'].to_numpy()[:,None] >= [df.columns[1:]],
df.iloc[:,1:].fillna(0), df.iloc[:,1:])
Output:
End 2022-01 2022-02 2022-03 2022-04 2022-05 2022-06
0 2022-03 1.0 0.0 0.0 NaN NaN NaN
1 2022-05 2.0 3.0 1.0 0.0 0.0 NaN
2 2022-06 0.0 4.0 3.0 2.0 0.0 0.0
Note: It might be better setting End
as the index.
Use numpy broadcasting on the index/columns with mask
and fillna
:
mask = df['End'].to_numpy()[:, None] >= df.columns.to_numpy()
out = df.fillna(df.mask(mask, 0))
print(out)
Output:
End 2022-01 2022-02 2022-03 2022-04 2022-05 2022-06
0 2022-03 1.0 0.0 0.0 NaN NaN NaN
1 2022-05 2.0 3.0 1.0 0.0 0.0 NaN
2 2022-06 0.0 4.0 3.0 2.0 0.0 0.0
Intermediate mask
:
array([[ True, True, False, False, False, False],
[ True, True, True, True, False, False],
[ True, True, True, True, True, False]])
Probably not the most elegant solution but can be done using pd.melt
and pd.pivot
:
melt_df = df.melt(id_vars=["End"])
melt_df.loc[(melt_df["End"] >= melt_df["variable"]) & (melt_df["value"].isnull()), "value"] = 0
This makes checking your condition easier. Then you reverse back to get original df format:
final_df = melt_df.pivot(index="End", columns="variable", values="value").reset_index()
final_df.columns.name = None
End 2022-01 2022-02 2022-03 2022-04 2022-05 2022-06
0 2022-03 1.0 0.0 0.0 NaN NaN NaN
1 2022-05 2.0 3.0 1.0 0.0 0.0 NaN
2 2022-06 0.0 4.0 3.0 2.0 0.0 0.0