Groupby and generate a column saying how many values are imputed

Question:

I have a dataframe that looks like this:

Region Country Imputed Year Price
Africa South Africa No 2016 500
Africa South Africa No 2017 400
Africa South Africa Yes 2018 432
Africa South Africa No 2019 450
Africa Nigeria Yes 2016 750
Africa Nigeria Yes 2017 780
Africa Nigeria No 2018 816
Africa Nigeria No 2019 890
Africa Kenya Yes 2016 212
Africa Kenya No 2017 376
Africa Kenya No 2018 415
Africa Kenya No 2019 430

Here is the sample data:

data1 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa'],
         'Country': ['South Africa','South Africa','South Africa','South Africa','Nigeria','Nigeria','Nigeria','Nigeria','Kenya','Kenya','Kenya','Kenya'],
         'Imputed': ['No','No','Yes','No','Yes','Yes','No','No','Yes','No','No','No'],
         'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
         'Price': [500, 400, 432,450,750,780,816,890,212,376,415,430]}
df = pd.DataFrame(data1)

I have to do a groupby using Region and Year to calculate the regional price for each year, which is straightforward to do. However, I would like to add a new column which says how many values have been imputed when doing the groupby.

The output should look like this:

Region Imputed Year Price
Africa 2/3 Components Imputed 2016 487.3
Africa 1/3 Components Imputed 2017 518.7
Africa 1/3 Components Imputed 2018 554.3
Africa 0/3 Components Imputed 2019 590

Below is my code so far:

df = df.groupby(['Region','Year'])['Price'].mean()

Is there any way of adding the additional column as per my desired output example?

Asked By: A.N.

||

Answers:

You can create helper column with compare Imputed and for count Trues aggregate by sum, for total cout use GroupBy.size (and if necassary use mean for Count_Imputed), last join columns to Imputed and for expected order of columns use list:

df1 = (df.assign(Imputed = df['Imputed'].eq('Yes'))
       .groupby(['Region','Year'], as_index=False)
       .agg(Price=('Price','mean'),
            Imputed=('Imputed','sum'),
            new=('Imputed','size'),
            Count_Imputed=('Imputed','mean')))

df1['Imputed'] = (df1['Imputed'].astype(str) + '/' +
                 df1['new'].astype(str) + ' Components Imputed')

df1 = df1[['Region','Imputed','Count_Imputed','Year','Price']]
print (df1)
   Region                 Imputed  Count_Imputed  Year       Price
0  Africa  2/3 Components Imputed       0.666667  2016  487.333333
1  Africa  1/3 Components Imputed       0.333333  2017  518.666667
2  Africa  1/3 Components Imputed       0.333333  2018  554.333333
3  Africa  0/3 Components Imputed       0.000000  2019  590.000000
Answered By: jezrael

With single-pass aggregation using {'Yes':1, 'No':0} mapping and string formatting:

df = df.groupby(['Region','Year'])
    .agg({'Price': 'mean',
          'Imputed': lambda x: f"{x.map({'Yes':1, 'No':0}).sum()}/{x.size}"})
    .reset_index()

   Region  Year       Price Imputed
0  Africa  2016  487.333333     2/3
1  Africa  2017  518.666667     1/3
2  Africa  2018  554.333333     1/3
3  Africa  2019  590.000000     0/3
Answered By: RomanPerekhrest
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.