Groupby and generate a column saying how many values are imputed

Question

I have a dataframe that looks like this:

Region	Country	Imputed	Year	Price
Africa	South Africa	No	2016	500
Africa	South Africa	No	2017	400
Africa	South Africa	Yes	2018	432
Africa	South Africa	No	2019	450
Africa	Nigeria	Yes	2016	750
Africa	Nigeria	Yes	2017	780
Africa	Nigeria	No	2018	816
Africa	Nigeria	No	2019	890
Africa	Kenya	Yes	2016	212
Africa	Kenya	No	2017	376
Africa	Kenya	No	2018	415
Africa	Kenya	No	2019	430

Here is the sample data:

data1 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa'],
         'Country': ['South Africa','South Africa','South Africa','South Africa','Nigeria','Nigeria','Nigeria','Nigeria','Kenya','Kenya','Kenya','Kenya'],
         'Imputed': ['No','No','Yes','No','Yes','Yes','No','No','Yes','No','No','No'],
         'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
         'Price': [500, 400, 432,450,750,780,816,890,212,376,415,430]}
df = pd.DataFrame(data1)

I have to do a groupby using Region and Year to calculate the regional price for each year, which is straightforward to do. However, I would like to add a new column which says how many values have been imputed when doing the groupby.

The output should look like this:

Region	Imputed	Year	Price
Africa	2/3 Components Imputed	2016	487.3
Africa	1/3 Components Imputed	2017	518.7
Africa	1/3 Components Imputed	2018	554.3
Africa	0/3 Components Imputed	2019	590

Below is my code so far:

df = df.groupby(['Region','Year'])['Price'].mean()

Is there any way of adding the additional column as per my desired output example?

Asked By: A.N.

||

Source

Answer 1

You can create helper column with compare Imputed and for count Trues aggregate by sum, for total cout use GroupBy.size (and if necassary use mean for Count_Imputed), last join columns to Imputed and for expected order of columns use list:

df1 = (df.assign(Imputed = df['Imputed'].eq('Yes'))
       .groupby(['Region','Year'], as_index=False)
       .agg(Price=('Price','mean'),
            Imputed=('Imputed','sum'),
            new=('Imputed','size'),
            Count_Imputed=('Imputed','mean')))

df1['Imputed'] = (df1['Imputed'].astype(str) + '/' +
                 df1['new'].astype(str) + ' Components Imputed')

df1 = df1[['Region','Imputed','Count_Imputed','Year','Price']]
print (df1)
   Region                 Imputed  Count_Imputed  Year       Price
0  Africa  2/3 Components Imputed       0.666667  2016  487.333333
1  Africa  1/3 Components Imputed       0.333333  2017  518.666667
2  Africa  1/3 Components Imputed       0.333333  2018  554.333333
3  Africa  0/3 Components Imputed       0.000000  2019  590.000000

Answered By: jezrael

Answer 2

With single-pass aggregation using {'Yes':1, 'No':0} mapping and string formatting:

df = df.groupby(['Region','Year'])
    .agg({'Price': 'mean',
          'Imputed': lambda x: f"{x.map({'Yes':1, 'No':0}).sum()}/{x.size}"})
    .reset_index()

   Region  Year       Price Imputed
0  Africa  2016  487.333333     2/3
1  Africa  2017  518.666667     1/3
2  Africa  2018  554.333333     1/3
3  Africa  2019  590.000000     0/3

Answered By: RomanPerekhrest

Groupby and generate a column saying how many values are imputed

Question:

Answers: