Replacing the missing value Using Pandas

Question:

{'Country': 'USA', 'Age': '52', 'Sal': '12345', 'OnWork': 'No'}
{'Country': 'UK', 'Age': '23', 'Sal': '1142', 'OnWork': 'Yes'}
{'Country': 'MAL', 'Age': '25', 'Sal': '4456', 'OnWork': 'No'}
{'Country': 'MAL', 'Age': '25', 'Sal': '4456', 'OnWork': 'No'}
{'Country': 'MAL', 'Age': '?', 'Sal': '2345', 'OnWork': 'Yes'}
{'Country': 'MAL', 'Age': '25', 'Sal': '3342', 'OnWork': 'Yes'}
{'Country': 'MAL', 'Age': '25', 'Sal': '3452', 'OnWork': 'No'}
{'Country': 'MAL', 'Age': '?', 'Sal': '3562', 'OnWork': 'No'}

Here I have to replace the missing mean value bases on “OnWork” value. Group Yes and its mean value go to Row5 Age. Group NO and its value should go to the Last row.

df = pd.read_csv("Mycal.csv", na_values = missing_values, nrows=50)

Find and replace the Mean value (This is working)

df["F8"].fillna(df['F8'].mean(), inplace=True)

here I am able to find the Mean value, However I am not able to replace it.

df[df["Class"]=="Yes"]["F8"].mean()

I am expecting the Yes values should group and fill Missing value the Mean to fill same for NO. Kindly help me with this

Asked By: Mariappan M

||

Answers:

Use mask and fillna as:

df['Age'] = df['Age'].mask(df['Age'].eq('?'), np.nan).astype(float)
df['Age'] = (df['Age'].fillna(df.groupby('OnWork')['Age'].transform(np.nanmean))
                      .astype(int))

print(df)
  Country  Age    Sal OnWork
0     USA   52  12345     No
1      UK   23   1142    Yes
2     MAL   25   4456     No
3     MAL   25   4456     No
4     MAL   24   2345    Yes
5     MAL   25   3342    Yes
6     MAL   25   3452     No
7     MAL   31   3562     No

If you want to replace multiple column values at once use:

df = df.fillna(df.groupby('OnWork').transform('mean'))
Answered By: Space Impact

If you mean to replace missing values by average for each group, then here is one of the solution:

df_mean = df.groupby('Class')['F8'].mean().reset_index()
df_mean.columns = ['Class','F8_mean']
df = pd.merge(df, df_mean, on='Class', how='left')
df.loc[df['F8'].isnull(), 'F8'] = df['F8_mean']
df.drop('F8_mean', axis=1, inplace=True)
Answered By: Dmitry Efimov
#import libries
import pandas as pd
import numpy as np

# Data dictionary
data_dict = {'Country': ['USA','UK','MAL','MAL','MAL','MAL','MAL','MAL'], 
              'Age': ['52','23','25','25','?','25','25','?'], 'Sal': ['12345','1142','4456','4456','2345','3342','3452','3562'], 
              'OnWork': ['No','Yes','No','No','Yes','Yes','No','No']} 

# Convert dictionary to dataframe
df = pd.DataFrame(data_dict)

# print input df
print(df)

**

       Country Age  Sal    OnWork
    0     USA  52  12345     No
    1      UK  23   1142    Yes
    2     MAL  25   4456     No
    3     MAL  25   4456     No
    4     MAL   ?   2345    Yes
    5     MAL  25   3342    Yes
    6     MAL  25   3452     No
    7     MAL   ?   3562     No

**

# '?' Values replace with NaN
df.Age=df.Age.where(df.Age!='?')

# Convert string values to numeric 
df["Age"] = pd.to_numeric(df["Age"])

# Get mean values Separately
mean_list = df.groupby('OnWork')['Age'].mean().astype(int)

# print mean values 
print(mean_list)

**

No     31
Yes    24

**

# Replace the missing age value 
df['Age'] = df.apply(
    lambda row: mean_list['Yes'] if np.isnan(row['Age'])&(row['OnWork']=='Yes')  else mean_list['No'] if np.isnan(row['Age'])&(row['OnWork']=='No') else row['Age'],
    axis=1
)

# print final df
print(df)

**

  Country   Age    Sal OnWork
0     USA  52.0  12345     No
1      UK  23.0   1142    Yes
2     MAL  25.0   4456     No
3     MAL  25.0   4456     No
4     MAL  24.0   2345    Yes
5     MAL  25.0   3342    Yes
6     MAL  25.0   3452     No
7     MAL  31.0   3562     No

**

Answered By: Dinesh Sandaruwan

Would start by adjusting the dataframe:

  1. Replace the ? with numpy.NaN

    df.replace('?', np.nan, inplace=True)
    
  2. Convert the Age column to numeric with pandas.to_numeric:

    df['Age'] = pd.to_numeric(df['Age'])
    

Then, with those changes, one can use pandas.DataFrame.groupby and pandas.Series.transform with a custom lambda function as follows

df['Age'] = df.groupby('OnWork')['Age'].transform(lambda x: x.fillna(x.mean())).astype('int')

[Out]:

  Country  Age    Sal OnWork
0     USA   52  12345     No
1      UK   23   1142    Yes
2     MAL   25   4456     No
3     MAL   25   4456     No
4     MAL   24   2345    Yes
5     MAL   25   3342    Yes
6     MAL   25   3452     No
7     MAL   31   3562     No

Notes:

  • .astype('int') is to make sure that the column Age is of integer type.
Answered By: Gonçalo Peres
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.