Replace missing values with the value of the column with the minimum sum of differences

Question:

I have the dataframe below.

# Create a sample DataFrame
df = pd.DataFrame({'Age': [np.nan, 31, 29, 43, np.nan],
                   'Weight': [np.nan, 100, 60, 75, np.nan],
                   'Height': [1.65, 1.64, 1.75, 1.70, 1.68],
                   'BMI': [19, 15, 10, 25, 30]})

and the columns I want to replace missing values for:

case_columns = ['Age', 'Weight']

I want an algorithm -in python- which will replace the missing values with the same value of the row with: the minimum sum of the difference between the row of the missing value and the others.

In my example, in row 0, the age should be 31 and the weight 100, having the min difference ((1.65-164) + (19-15)) with row 1. In row 4 the age should be 43 and the weight 75.

How can I do this in Python?

Asked By: OlgaE

||

Answers:

You can try creating a function and using df.apply()

def fill_missing(x):
    # if age or weight are missing
    if any(np.isnan(x.drop('Height'))):
        # create series df height - row height (exlude current row)
        height_diff = np.abs(df.drop(x.name)['Height'] - x['Height'])
        # get row index of minimum (obs: remember to use abs)
        row_idx = height_diff.idxmin()
        # substitute whatever is missing
        for feature in x.index:
            if np.isnan(x[feature]):
                x[feature] = df.loc[row_idx][feature]
    return x

df.apply(fill_missing, axis=1)

# if you want to change the value of df
df = df.apply(fill_missing, axis=1)


Answered By: Lucas Moura Gomes