Pandas impute values from grouped dataframe to another dataframe with an object

Question:

For the dataset I’m working with, I want to impute null values with medians grouped by country. I have created a grouped table (median_data in the code below), which contains all the median values by country.

I need to do this median calculation and imputation in separate steps since the end goal is to create an object with ‘fit’ and ‘transform’ methods, so that I can calculate medians based only on the train data, and impute to the test data.

Here’s dummy data I’m working with:

data =  [['A', 10, 20, np.nan, np.nan, 50, 30], ['A', 2, 1, 5, np.nan, 34, 35], ['A', 13, 212, 3, 6, np.nan, 37],
         ['B', 120, 230, 53, np.nan, 63, 23], ['B', 22, 115, 15, 61, 4, 15], ['B', np.nan, 22, 12, np.nan, np.nan, 31],
         ['C', 105, 120, np.nan, 22, 520, 3], ['C', 26, 11, 15, np.nan, 34, 3], ['C', 13, np.nan, 13, 234, np.nan, 10],
         ['D', 101, 220, 654, 143, 634, 123], ['D', 32, 21, 61, 24, np.nan, 32], ['D', 11, 72, 23, np.nan, 534, 30]
        ]
df = pd.DataFrame(data, columns=['Country','col1','col2','col3','col4','col5','col6'])


median_data = df.groupby('Country').median().reset_index()

Currently not using an object, just trying to figure out how to do it. Using loops isn’t working. I tried different iterations, here’s where I’m at right now:

df_new = df.copy()
for country in median_data.Country:
    country_data = median_data[median_data.Country == country].copy()
    for col in median_data.columns[2:]:
        df_new[col] = df_new[col].fillna(country_data[col])

The dataset, df:

enter image description here

The table with medians grouped by ‘Country’:

enter image description here

Result from the code above (clearly incorrect).
As an example, col4 for Country A should be [6, 6, 6], but the values I get are [6, 61, 6]:

enter image description here

Is there a way to separately calculate median values and impute them? Efficiency is not my primary concern, but an efficient solution would obviously be preferred.

Update: While this method seems to work initially, it doesn’t achieve what I need it to. I specifically need to be able to store the median values somehow and then use them to impute to the test set as well. Since the train and test sets are by definition of unequal sizes, using update or combine_first wouldn’t work, since the row numbers wouldn’t correspond and the imputed values would be wrong, even if somehow the number of rows was the same.

Update 2: The solution provided by @jezrael works, but with a catch.

median_data = df1.groupby('Country').median()
df2.update(df2[['Country']].merge(median_data, on='Country',  how='left'), overwrite=False)

This code will work, but first the datasets need to have their indices reset with reset_index(drop=True, inplace=True). This works for now, but maybe resetting indices will mess with model training etc. To be seen.

Asked By: MushyMush

||

Answers:

Use DataFrame.update with GroupBy.transform with median:

df.update(df.groupby('Country').transform('median'), overwrite=False)
print (df)
   Country   col1   col2   col3   col4   col5  col6
0        A   10.0   20.0    4.0    6.0   50.0    30
1        A    2.0    1.0    5.0    6.0   34.0    35
2        A   13.0  212.0    3.0    6.0   42.0    37
3        B  120.0  230.0   53.0   61.0   63.0    23
4        B   22.0  115.0   15.0   61.0    4.0    15
5        B   71.0   22.0   12.0   61.0   33.5    31
6        C  105.0  120.0   14.0   22.0  520.0     3
7        C   26.0   11.0   15.0  128.0   34.0     3
8        C   13.0   65.5   13.0  234.0  277.0    10
9        D  101.0  220.0  654.0  143.0  634.0   123
10       D   32.0   21.0   61.0   24.0  584.0    32
11       D   11.0   72.0   23.0   83.5  534.0    30

Details:

print (df.groupby('Country').transform('median'))
    col1   col2  col3   col4   col5  col6
0   10.0   20.0   4.0    6.0   42.0    35
1   10.0   20.0   4.0    6.0   42.0    35
2   10.0   20.0   4.0    6.0   42.0    35
3   71.0  115.0  15.0   61.0   33.5    23
4   71.0  115.0  15.0   61.0   33.5    23
5   71.0  115.0  15.0   61.0   33.5    23
6   26.0   65.5  14.0  128.0  277.0     3
7   26.0   65.5  14.0  128.0  277.0     3
8   26.0   65.5  14.0  128.0  277.0     3
9   32.0   72.0  61.0   83.5  584.0    32
10  32.0   72.0  61.0   83.5  584.0    32
11  32.0   72.0  61.0   83.5  584.0    32

Alternative solution with DataFrame.combine_first:

df1 = df.combine_first(df.groupby('Country').transform('median'))
print (df1)
   Country   col1   col2   col3   col4   col5  col6
0        A   10.0   20.0    4.0    6.0   50.0    30
1        A    2.0    1.0    5.0    6.0   34.0    35
2        A   13.0  212.0    3.0    6.0   42.0    37
3        B  120.0  230.0   53.0   61.0   63.0    23
4        B   22.0  115.0   15.0   61.0    4.0    15
5        B   71.0   22.0   12.0   61.0   33.5    31
6        C  105.0  120.0   14.0   22.0  520.0     3
7        C   26.0   11.0   15.0  128.0   34.0     3
8        C   13.0   65.5   13.0  234.0  277.0    10
9        D  101.0  220.0  654.0  143.0  634.0   123
10       D   32.0   21.0   61.0   24.0  584.0    32
11       D   11.0   72.0   23.0   83.5  534.0    30
Answered By: jezrael
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.