Groupby and generate a column saying how many values are imputed
Question:
I have a dataframe that looks like this:
Region
Country
Imputed
Year
Price
Africa
South Africa
No
2016
500
Africa
South Africa
No
2017
400
Africa
South Africa
Yes
2018
432
Africa
South Africa
No
2019
450
Africa
Nigeria
Yes
2016
750
Africa
Nigeria
Yes
2017
780
Africa
Nigeria
No
2018
816
Africa
Nigeria
No
2019
890
Africa
Kenya
Yes
2016
212
Africa
Kenya
No
2017
376
Africa
Kenya
No
2018
415
Africa
Kenya
No
2019
430
Here is the sample data:
data1 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa'],
'Country': ['South Africa','South Africa','South Africa','South Africa','Nigeria','Nigeria','Nigeria','Nigeria','Kenya','Kenya','Kenya','Kenya'],
'Imputed': ['No','No','Yes','No','Yes','Yes','No','No','Yes','No','No','No'],
'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
'Price': [500, 400, 432,450,750,780,816,890,212,376,415,430]}
df = pd.DataFrame(data1)
I have to do a groupby
using Region
and Year
to calculate the regional price for each year, which is straightforward to do. However, I would like to add a new column which says how many values have been imputed when doing the groupby
.
The output should look like this:
Region
Imputed
Year
Price
Africa
2/3 Components Imputed
2016
487.3
Africa
1/3 Components Imputed
2017
518.7
Africa
1/3 Components Imputed
2018
554.3
Africa
0/3 Components Imputed
2019
590
Below is my code so far:
df = df.groupby(['Region','Year'])['Price'].mean()
Is there any way of adding the additional column as per my desired output example?
Answers:
You can create helper column with compare Imputed
and for count True
s aggregate by sum
, for total cout use GroupBy.size
(and if necassary use mean
for Count_Imputed
), last join columns to Imputed
and for expected order of columns use list:
df1 = (df.assign(Imputed = df['Imputed'].eq('Yes'))
.groupby(['Region','Year'], as_index=False)
.agg(Price=('Price','mean'),
Imputed=('Imputed','sum'),
new=('Imputed','size'),
Count_Imputed=('Imputed','mean')))
df1['Imputed'] = (df1['Imputed'].astype(str) + '/' +
df1['new'].astype(str) + ' Components Imputed')
df1 = df1[['Region','Imputed','Count_Imputed','Year','Price']]
print (df1)
Region Imputed Count_Imputed Year Price
0 Africa 2/3 Components Imputed 0.666667 2016 487.333333
1 Africa 1/3 Components Imputed 0.333333 2017 518.666667
2 Africa 1/3 Components Imputed 0.333333 2018 554.333333
3 Africa 0/3 Components Imputed 0.000000 2019 590.000000
With single-pass aggregation using {'Yes':1, 'No':0}
mapping and string formatting:
df = df.groupby(['Region','Year'])
.agg({'Price': 'mean',
'Imputed': lambda x: f"{x.map({'Yes':1, 'No':0}).sum()}/{x.size}"})
.reset_index()
Region Year Price Imputed
0 Africa 2016 487.333333 2/3
1 Africa 2017 518.666667 1/3
2 Africa 2018 554.333333 1/3
3 Africa 2019 590.000000 0/3
I have a dataframe that looks like this:
Region | Country | Imputed | Year | Price |
---|---|---|---|---|
Africa | South Africa | No | 2016 | 500 |
Africa | South Africa | No | 2017 | 400 |
Africa | South Africa | Yes | 2018 | 432 |
Africa | South Africa | No | 2019 | 450 |
Africa | Nigeria | Yes | 2016 | 750 |
Africa | Nigeria | Yes | 2017 | 780 |
Africa | Nigeria | No | 2018 | 816 |
Africa | Nigeria | No | 2019 | 890 |
Africa | Kenya | Yes | 2016 | 212 |
Africa | Kenya | No | 2017 | 376 |
Africa | Kenya | No | 2018 | 415 |
Africa | Kenya | No | 2019 | 430 |
Here is the sample data:
data1 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa'],
'Country': ['South Africa','South Africa','South Africa','South Africa','Nigeria','Nigeria','Nigeria','Nigeria','Kenya','Kenya','Kenya','Kenya'],
'Imputed': ['No','No','Yes','No','Yes','Yes','No','No','Yes','No','No','No'],
'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
'Price': [500, 400, 432,450,750,780,816,890,212,376,415,430]}
df = pd.DataFrame(data1)
I have to do a groupby
using Region
and Year
to calculate the regional price for each year, which is straightforward to do. However, I would like to add a new column which says how many values have been imputed when doing the groupby
.
The output should look like this:
Region | Imputed | Year | Price |
---|---|---|---|
Africa | 2/3 Components Imputed | 2016 | 487.3 |
Africa | 1/3 Components Imputed | 2017 | 518.7 |
Africa | 1/3 Components Imputed | 2018 | 554.3 |
Africa | 0/3 Components Imputed | 2019 | 590 |
Below is my code so far:
df = df.groupby(['Region','Year'])['Price'].mean()
Is there any way of adding the additional column as per my desired output example?
You can create helper column with compare Imputed
and for count True
s aggregate by sum
, for total cout use GroupBy.size
(and if necassary use mean
for Count_Imputed
), last join columns to Imputed
and for expected order of columns use list:
df1 = (df.assign(Imputed = df['Imputed'].eq('Yes'))
.groupby(['Region','Year'], as_index=False)
.agg(Price=('Price','mean'),
Imputed=('Imputed','sum'),
new=('Imputed','size'),
Count_Imputed=('Imputed','mean')))
df1['Imputed'] = (df1['Imputed'].astype(str) + '/' +
df1['new'].astype(str) + ' Components Imputed')
df1 = df1[['Region','Imputed','Count_Imputed','Year','Price']]
print (df1)
Region Imputed Count_Imputed Year Price
0 Africa 2/3 Components Imputed 0.666667 2016 487.333333
1 Africa 1/3 Components Imputed 0.333333 2017 518.666667
2 Africa 1/3 Components Imputed 0.333333 2018 554.333333
3 Africa 0/3 Components Imputed 0.000000 2019 590.000000
With single-pass aggregation using {'Yes':1, 'No':0}
mapping and string formatting:
df = df.groupby(['Region','Year'])
.agg({'Price': 'mean',
'Imputed': lambda x: f"{x.map({'Yes':1, 'No':0}).sum()}/{x.size}"})
.reset_index()
Region Year Price Imputed
0 Africa 2016 487.333333 2/3
1 Africa 2017 518.666667 1/3
2 Africa 2018 554.333333 1/3
3 Africa 2019 590.000000 0/3