Pandas Fill in Missing Row in Group with multiple keys
Question:
I’m looking to fill in a dataframe with a missing row based on a few criteria.
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 keyA 39686 non-null object
1 keyB 39686 non-null int64
2 keyC 39686 non-null object
3 keyD 39686 non-null object
4 snapshot_week 39686 non-null datetime64[ns]
5 metric1 39686 non-null int64
6 metric2 39686 non-null int64
dtypes: datetime64[ns](1), int64(1), object(5)
A1/B1/C1/D1 has data missing for 2022-08-20, and A3/B3/C3/D3 is missing 2022-08-27.
[['A1','B1','C1','D1','2022-08-27',5000,5000],
['A2','B2','C2','D2','2022-08-20',4278,4278],
['A2','B2','C2','D2','2022-08-27',6852,6852],
['A3','B3','C3','D3','2022-08-20',9587,9587]]
I’d like to update the data frame with the following extra rows, using 0 as the default value for metric1/metric2.
[['A1','B1','C1','D1','2022-08-20',0,0], # <---
['A1','B1','C1','D1','2022-08-27',5000,5000],
['A2','B2','C2','D2','2022-08-20',4278,4278],
['A2','B2','C2','D2','2022-08-27',6852,6852],
['A3','B3','C3','D3','2022-08-20',9587,9587],
['A3','B3','C3','D3','2022-08-27',0,0]] # <--
I’ve tried a few different techniques such as reindex
, asfreq
, groupby
but I have yet to achieve the desired results. The snapshot_week values will only be on a Saturday, and only two dates will ever be present at a given time. Not all key permutations are needed, and some keys are numeric identifiers. Essentially, I just need to have the week over week record for the key combination with metrics defaulting to 0.
Thanks in advance for the support!
Answers:
Try as follows. First, we use df.pivot
to "group" all values (per each week) for each key*
-sequence (as index). Next, we use df.stack
with dropna
parameter set to False
to get rows for all key*
-sequences per week. Finally, we reset the index and replace the NaNs with zeros.
import pandas as pd
data = [['A1','B1','C1','D1','2022-08-27',5000,5000],
['A2','B2','C2','D2','2022-08-20',4278,4278],
['A2','B2','C2','D2','2022-08-27',6852,6852],
['A3','B3','C3','D3','2022-08-20',9587,9587]]
cols = ['keyA','keyB','keyC','keyD','snapshot_week', 'metric1', 'metric2']
df = pd.DataFrame(data, columns=cols)
df_new = df.pivot(index=['keyA','keyB','keyC','keyD'],
columns=['snapshot_week'],
values=['metric1','metric2'])
.stack(dropna=False).reset_index(drop=False).fillna(0)
print(df_new)
keyA keyB keyC keyD snapshot_week metric1 metric2
0 A1 B1 C1 D1 2022-08-20 0.0 0.0
1 A1 B1 C1 D1 2022-08-27 5000.0 5000.0
2 A2 B2 C2 D2 2022-08-20 4278.0 4278.0
3 A2 B2 C2 D2 2022-08-27 6852.0 6852.0
4 A3 B3 C3 D3 2022-08-20 9587.0 9587.0
5 A3 B3 C3 D3 2022-08-27 0.0 0.0
df1.snapshot_week.drop_duplicates().to_frame('snapshot_week')
.merge(df1.loc[:, ['keyA', 'keyB', 'keyC', 'keyD']].drop_duplicates(), how="cross")
.merge(df1, how='left', on=['keyA', 'keyB', 'keyC', 'keyD', 'snapshot_week'], suffixes=("", "_y"))
.fillna(0).sort_values(['keyA', 'keyB', 'keyC', 'keyD','snapshot_week'])
I’m looking to fill in a dataframe with a missing row based on a few criteria.
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 keyA 39686 non-null object
1 keyB 39686 non-null int64
2 keyC 39686 non-null object
3 keyD 39686 non-null object
4 snapshot_week 39686 non-null datetime64[ns]
5 metric1 39686 non-null int64
6 metric2 39686 non-null int64
dtypes: datetime64[ns](1), int64(1), object(5)
A1/B1/C1/D1 has data missing for 2022-08-20, and A3/B3/C3/D3 is missing 2022-08-27.
[['A1','B1','C1','D1','2022-08-27',5000,5000],
['A2','B2','C2','D2','2022-08-20',4278,4278],
['A2','B2','C2','D2','2022-08-27',6852,6852],
['A3','B3','C3','D3','2022-08-20',9587,9587]]
I’d like to update the data frame with the following extra rows, using 0 as the default value for metric1/metric2.
[['A1','B1','C1','D1','2022-08-20',0,0], # <---
['A1','B1','C1','D1','2022-08-27',5000,5000],
['A2','B2','C2','D2','2022-08-20',4278,4278],
['A2','B2','C2','D2','2022-08-27',6852,6852],
['A3','B3','C3','D3','2022-08-20',9587,9587],
['A3','B3','C3','D3','2022-08-27',0,0]] # <--
I’ve tried a few different techniques such as reindex
, asfreq
, groupby
but I have yet to achieve the desired results. The snapshot_week values will only be on a Saturday, and only two dates will ever be present at a given time. Not all key permutations are needed, and some keys are numeric identifiers. Essentially, I just need to have the week over week record for the key combination with metrics defaulting to 0.
Thanks in advance for the support!
Try as follows. First, we use df.pivot
to "group" all values (per each week) for each key*
-sequence (as index). Next, we use df.stack
with dropna
parameter set to False
to get rows for all key*
-sequences per week. Finally, we reset the index and replace the NaNs with zeros.
import pandas as pd
data = [['A1','B1','C1','D1','2022-08-27',5000,5000],
['A2','B2','C2','D2','2022-08-20',4278,4278],
['A2','B2','C2','D2','2022-08-27',6852,6852],
['A3','B3','C3','D3','2022-08-20',9587,9587]]
cols = ['keyA','keyB','keyC','keyD','snapshot_week', 'metric1', 'metric2']
df = pd.DataFrame(data, columns=cols)
df_new = df.pivot(index=['keyA','keyB','keyC','keyD'],
columns=['snapshot_week'],
values=['metric1','metric2'])
.stack(dropna=False).reset_index(drop=False).fillna(0)
print(df_new)
keyA keyB keyC keyD snapshot_week metric1 metric2
0 A1 B1 C1 D1 2022-08-20 0.0 0.0
1 A1 B1 C1 D1 2022-08-27 5000.0 5000.0
2 A2 B2 C2 D2 2022-08-20 4278.0 4278.0
3 A2 B2 C2 D2 2022-08-27 6852.0 6852.0
4 A3 B3 C3 D3 2022-08-20 9587.0 9587.0
5 A3 B3 C3 D3 2022-08-27 0.0 0.0
df1.snapshot_week.drop_duplicates().to_frame('snapshot_week')
.merge(df1.loc[:, ['keyA', 'keyB', 'keyC', 'keyD']].drop_duplicates(), how="cross")
.merge(df1, how='left', on=['keyA', 'keyB', 'keyC', 'keyD', 'snapshot_week'], suffixes=("", "_y"))
.fillna(0).sort_values(['keyA', 'keyB', 'keyC', 'keyD','snapshot_week'])