Pandas impute values from grouped dataframe to another dataframe with an object
Question:
For the dataset I’m working with, I want to impute null values with medians grouped by country. I have created a grouped table (median_data in the code below), which contains all the median values by country.
I need to do this median calculation and imputation in separate steps since the end goal is to create an object with ‘fit’ and ‘transform’ methods, so that I can calculate medians based only on the train data, and impute to the test data.
Here’s dummy data I’m working with:
data = [['A', 10, 20, np.nan, np.nan, 50, 30], ['A', 2, 1, 5, np.nan, 34, 35], ['A', 13, 212, 3, 6, np.nan, 37],
['B', 120, 230, 53, np.nan, 63, 23], ['B', 22, 115, 15, 61, 4, 15], ['B', np.nan, 22, 12, np.nan, np.nan, 31],
['C', 105, 120, np.nan, 22, 520, 3], ['C', 26, 11, 15, np.nan, 34, 3], ['C', 13, np.nan, 13, 234, np.nan, 10],
['D', 101, 220, 654, 143, 634, 123], ['D', 32, 21, 61, 24, np.nan, 32], ['D', 11, 72, 23, np.nan, 534, 30]
]
df = pd.DataFrame(data, columns=['Country','col1','col2','col3','col4','col5','col6'])
median_data = df.groupby('Country').median().reset_index()
Currently not using an object, just trying to figure out how to do it. Using loops isn’t working. I tried different iterations, here’s where I’m at right now:
df_new = df.copy()
for country in median_data.Country:
country_data = median_data[median_data.Country == country].copy()
for col in median_data.columns[2:]:
df_new[col] = df_new[col].fillna(country_data[col])
The dataset, df:
The table with medians grouped by ‘Country’:
Result from the code above (clearly incorrect).
As an example, col4 for Country A should be [6, 6, 6], but the values I get are [6, 61, 6]:
Is there a way to separately calculate median values and impute them? Efficiency is not my primary concern, but an efficient solution would obviously be preferred.
Update: While this method seems to work initially, it doesn’t achieve what I need it to. I specifically need to be able to store the median values somehow and then use them to impute to the test set as well. Since the train and test sets are by definition of unequal sizes, using update or combine_first wouldn’t work, since the row numbers wouldn’t correspond and the imputed values would be wrong, even if somehow the number of rows was the same.
Update 2: The solution provided by @jezrael works, but with a catch.
median_data = df1.groupby('Country').median()
df2.update(df2[['Country']].merge(median_data, on='Country', how='left'), overwrite=False)
This code will work, but first the datasets need to have their indices reset with reset_index(drop=True, inplace=True). This works for now, but maybe resetting indices will mess with model training etc. To be seen.
Answers:
Use DataFrame.update
with GroupBy.transform
with median
:
df.update(df.groupby('Country').transform('median'), overwrite=False)
print (df)
Country col1 col2 col3 col4 col5 col6
0 A 10.0 20.0 4.0 6.0 50.0 30
1 A 2.0 1.0 5.0 6.0 34.0 35
2 A 13.0 212.0 3.0 6.0 42.0 37
3 B 120.0 230.0 53.0 61.0 63.0 23
4 B 22.0 115.0 15.0 61.0 4.0 15
5 B 71.0 22.0 12.0 61.0 33.5 31
6 C 105.0 120.0 14.0 22.0 520.0 3
7 C 26.0 11.0 15.0 128.0 34.0 3
8 C 13.0 65.5 13.0 234.0 277.0 10
9 D 101.0 220.0 654.0 143.0 634.0 123
10 D 32.0 21.0 61.0 24.0 584.0 32
11 D 11.0 72.0 23.0 83.5 534.0 30
Details:
print (df.groupby('Country').transform('median'))
col1 col2 col3 col4 col5 col6
0 10.0 20.0 4.0 6.0 42.0 35
1 10.0 20.0 4.0 6.0 42.0 35
2 10.0 20.0 4.0 6.0 42.0 35
3 71.0 115.0 15.0 61.0 33.5 23
4 71.0 115.0 15.0 61.0 33.5 23
5 71.0 115.0 15.0 61.0 33.5 23
6 26.0 65.5 14.0 128.0 277.0 3
7 26.0 65.5 14.0 128.0 277.0 3
8 26.0 65.5 14.0 128.0 277.0 3
9 32.0 72.0 61.0 83.5 584.0 32
10 32.0 72.0 61.0 83.5 584.0 32
11 32.0 72.0 61.0 83.5 584.0 32
Alternative solution with DataFrame.combine_first
:
df1 = df.combine_first(df.groupby('Country').transform('median'))
print (df1)
Country col1 col2 col3 col4 col5 col6
0 A 10.0 20.0 4.0 6.0 50.0 30
1 A 2.0 1.0 5.0 6.0 34.0 35
2 A 13.0 212.0 3.0 6.0 42.0 37
3 B 120.0 230.0 53.0 61.0 63.0 23
4 B 22.0 115.0 15.0 61.0 4.0 15
5 B 71.0 22.0 12.0 61.0 33.5 31
6 C 105.0 120.0 14.0 22.0 520.0 3
7 C 26.0 11.0 15.0 128.0 34.0 3
8 C 13.0 65.5 13.0 234.0 277.0 10
9 D 101.0 220.0 654.0 143.0 634.0 123
10 D 32.0 21.0 61.0 24.0 584.0 32
11 D 11.0 72.0 23.0 83.5 534.0 30
For the dataset I’m working with, I want to impute null values with medians grouped by country. I have created a grouped table (median_data in the code below), which contains all the median values by country.
I need to do this median calculation and imputation in separate steps since the end goal is to create an object with ‘fit’ and ‘transform’ methods, so that I can calculate medians based only on the train data, and impute to the test data.
Here’s dummy data I’m working with:
data = [['A', 10, 20, np.nan, np.nan, 50, 30], ['A', 2, 1, 5, np.nan, 34, 35], ['A', 13, 212, 3, 6, np.nan, 37],
['B', 120, 230, 53, np.nan, 63, 23], ['B', 22, 115, 15, 61, 4, 15], ['B', np.nan, 22, 12, np.nan, np.nan, 31],
['C', 105, 120, np.nan, 22, 520, 3], ['C', 26, 11, 15, np.nan, 34, 3], ['C', 13, np.nan, 13, 234, np.nan, 10],
['D', 101, 220, 654, 143, 634, 123], ['D', 32, 21, 61, 24, np.nan, 32], ['D', 11, 72, 23, np.nan, 534, 30]
]
df = pd.DataFrame(data, columns=['Country','col1','col2','col3','col4','col5','col6'])
median_data = df.groupby('Country').median().reset_index()
Currently not using an object, just trying to figure out how to do it. Using loops isn’t working. I tried different iterations, here’s where I’m at right now:
df_new = df.copy()
for country in median_data.Country:
country_data = median_data[median_data.Country == country].copy()
for col in median_data.columns[2:]:
df_new[col] = df_new[col].fillna(country_data[col])
The dataset, df:
The table with medians grouped by ‘Country’:
Result from the code above (clearly incorrect).
As an example, col4 for Country A should be [6, 6, 6], but the values I get are [6, 61, 6]:
Is there a way to separately calculate median values and impute them? Efficiency is not my primary concern, but an efficient solution would obviously be preferred.
Update: While this method seems to work initially, it doesn’t achieve what I need it to. I specifically need to be able to store the median values somehow and then use them to impute to the test set as well. Since the train and test sets are by definition of unequal sizes, using update or combine_first wouldn’t work, since the row numbers wouldn’t correspond and the imputed values would be wrong, even if somehow the number of rows was the same.
Update 2: The solution provided by @jezrael works, but with a catch.
median_data = df1.groupby('Country').median()
df2.update(df2[['Country']].merge(median_data, on='Country', how='left'), overwrite=False)
This code will work, but first the datasets need to have their indices reset with reset_index(drop=True, inplace=True). This works for now, but maybe resetting indices will mess with model training etc. To be seen.
Use DataFrame.update
with GroupBy.transform
with median
:
df.update(df.groupby('Country').transform('median'), overwrite=False)
print (df)
Country col1 col2 col3 col4 col5 col6
0 A 10.0 20.0 4.0 6.0 50.0 30
1 A 2.0 1.0 5.0 6.0 34.0 35
2 A 13.0 212.0 3.0 6.0 42.0 37
3 B 120.0 230.0 53.0 61.0 63.0 23
4 B 22.0 115.0 15.0 61.0 4.0 15
5 B 71.0 22.0 12.0 61.0 33.5 31
6 C 105.0 120.0 14.0 22.0 520.0 3
7 C 26.0 11.0 15.0 128.0 34.0 3
8 C 13.0 65.5 13.0 234.0 277.0 10
9 D 101.0 220.0 654.0 143.0 634.0 123
10 D 32.0 21.0 61.0 24.0 584.0 32
11 D 11.0 72.0 23.0 83.5 534.0 30
Details:
print (df.groupby('Country').transform('median'))
col1 col2 col3 col4 col5 col6
0 10.0 20.0 4.0 6.0 42.0 35
1 10.0 20.0 4.0 6.0 42.0 35
2 10.0 20.0 4.0 6.0 42.0 35
3 71.0 115.0 15.0 61.0 33.5 23
4 71.0 115.0 15.0 61.0 33.5 23
5 71.0 115.0 15.0 61.0 33.5 23
6 26.0 65.5 14.0 128.0 277.0 3
7 26.0 65.5 14.0 128.0 277.0 3
8 26.0 65.5 14.0 128.0 277.0 3
9 32.0 72.0 61.0 83.5 584.0 32
10 32.0 72.0 61.0 83.5 584.0 32
11 32.0 72.0 61.0 83.5 584.0 32
Alternative solution with DataFrame.combine_first
:
df1 = df.combine_first(df.groupby('Country').transform('median'))
print (df1)
Country col1 col2 col3 col4 col5 col6
0 A 10.0 20.0 4.0 6.0 50.0 30
1 A 2.0 1.0 5.0 6.0 34.0 35
2 A 13.0 212.0 3.0 6.0 42.0 37
3 B 120.0 230.0 53.0 61.0 63.0 23
4 B 22.0 115.0 15.0 61.0 4.0 15
5 B 71.0 22.0 12.0 61.0 33.5 31
6 C 105.0 120.0 14.0 22.0 520.0 3
7 C 26.0 11.0 15.0 128.0 34.0 3
8 C 13.0 65.5 13.0 234.0 277.0 10
9 D 101.0 220.0 654.0 143.0 634.0 123
10 D 32.0 21.0 61.0 24.0 584.0 32
11 D 11.0 72.0 23.0 83.5 534.0 30