Pandas Fill in Missing Row in Group with multiple keys

Question:

I’m looking to fill in a dataframe with a missing row based on a few criteria.

Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   keyA            39686 non-null  object        
 1   keyB            39686 non-null  int64         
 2   keyC            39686 non-null  object        
 3   keyD            39686 non-null  object        
 4   snapshot_week   39686 non-null  datetime64[ns]
 5   metric1         39686 non-null  int64        
 6   metric2         39686 non-null  int64        
dtypes: datetime64[ns](1), int64(1), object(5)

A1/B1/C1/D1 has data missing for 2022-08-20, and A3/B3/C3/D3 is missing 2022-08-27.

[['A1','B1','C1','D1','2022-08-27',5000,5000],
['A2','B2','C2','D2','2022-08-20',4278,4278],
['A2','B2','C2','D2','2022-08-27',6852,6852],
['A3','B3','C3','D3','2022-08-20',9587,9587]]

I’d like to update the data frame with the following extra rows, using 0 as the default value for metric1/metric2.

[['A1','B1','C1','D1','2022-08-20',0,0],  # <--- 
['A1','B1','C1','D1','2022-08-27',5000,5000],
['A2','B2','C2','D2','2022-08-20',4278,4278],
['A2','B2','C2','D2','2022-08-27',6852,6852],
['A3','B3','C3','D3','2022-08-20',9587,9587],
['A3','B3','C3','D3','2022-08-27',0,0]] # <--

I’ve tried a few different techniques such as reindex, asfreq, groupby but I have yet to achieve the desired results. The snapshot_week values will only be on a Saturday, and only two dates will ever be present at a given time. Not all key permutations are needed, and some keys are numeric identifiers. Essentially, I just need to have the week over week record for the key combination with metrics defaulting to 0.

Thanks in advance for the support!

Asked By: Brent

||

Answers:

Try as follows. First, we use df.pivot to "group" all values (per each week) for each key*-sequence (as index). Next, we use df.stack with dropna parameter set to False to get rows for all key*-sequences per week. Finally, we reset the index and replace the NaNs with zeros.

import pandas as pd

data = [['A1','B1','C1','D1','2022-08-27',5000,5000],
['A2','B2','C2','D2','2022-08-20',4278,4278],
['A2','B2','C2','D2','2022-08-27',6852,6852],
['A3','B3','C3','D3','2022-08-20',9587,9587]]

cols = ['keyA','keyB','keyC','keyD','snapshot_week', 'metric1', 'metric2']
df = pd.DataFrame(data, columns=cols)

df_new = df.pivot(index=['keyA','keyB','keyC','keyD'], 
                  columns=['snapshot_week'], 
                  values=['metric1','metric2'])
    .stack(dropna=False).reset_index(drop=False).fillna(0)
    
print(df_new)

  keyA keyB keyC keyD snapshot_week  metric1  metric2
0   A1   B1   C1   D1    2022-08-20      0.0      0.0
1   A1   B1   C1   D1    2022-08-27   5000.0   5000.0
2   A2   B2   C2   D2    2022-08-20   4278.0   4278.0
3   A2   B2   C2   D2    2022-08-27   6852.0   6852.0
4   A3   B3   C3   D3    2022-08-20   9587.0   9587.0
5   A3   B3   C3   D3    2022-08-27      0.0      0.0
Answered By: ouroboros1
df1.snapshot_week.drop_duplicates().to_frame('snapshot_week')
    .merge(df1.loc[:, ['keyA', 'keyB', 'keyC', 'keyD']].drop_duplicates(), how="cross")
    .merge(df1, how='left', on=['keyA', 'keyB', 'keyC', 'keyD', 'snapshot_week'], suffixes=("", "_y"))
    .fillna(0).sort_values(['keyA', 'keyB', 'keyC', 'keyD','snapshot_week'])

enter image description here

Answered By: G.G
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.