Replacing the missing value Using Pandas
Question:
{'Country': 'USA', 'Age': '52', 'Sal': '12345', 'OnWork': 'No'}
{'Country': 'UK', 'Age': '23', 'Sal': '1142', 'OnWork': 'Yes'}
{'Country': 'MAL', 'Age': '25', 'Sal': '4456', 'OnWork': 'No'}
{'Country': 'MAL', 'Age': '25', 'Sal': '4456', 'OnWork': 'No'}
{'Country': 'MAL', 'Age': '?', 'Sal': '2345', 'OnWork': 'Yes'}
{'Country': 'MAL', 'Age': '25', 'Sal': '3342', 'OnWork': 'Yes'}
{'Country': 'MAL', 'Age': '25', 'Sal': '3452', 'OnWork': 'No'}
{'Country': 'MAL', 'Age': '?', 'Sal': '3562', 'OnWork': 'No'}
Here I have to replace the missing mean value bases on “OnWork” value. Group Yes and its mean value go to Row5 Age. Group NO and its value should go to the Last row.
df = pd.read_csv("Mycal.csv", na_values = missing_values, nrows=50)
Find and replace the Mean value (This is working)
df["F8"].fillna(df['F8'].mean(), inplace=True)
here I am able to find the Mean value, However I am not able to replace it.
df[df["Class"]=="Yes"]["F8"].mean()
I am expecting the Yes values should group and fill Missing value the Mean to fill same for NO. Kindly help me with this
Answers:
df['Age'] = df['Age'].mask(df['Age'].eq('?'), np.nan).astype(float)
df['Age'] = (df['Age'].fillna(df.groupby('OnWork')['Age'].transform(np.nanmean))
.astype(int))
print(df)
Country Age Sal OnWork
0 USA 52 12345 No
1 UK 23 1142 Yes
2 MAL 25 4456 No
3 MAL 25 4456 No
4 MAL 24 2345 Yes
5 MAL 25 3342 Yes
6 MAL 25 3452 No
7 MAL 31 3562 No
If you want to replace multiple column values at once use:
df = df.fillna(df.groupby('OnWork').transform('mean'))
If you mean to replace missing values by average for each group, then here is one of the solution:
df_mean = df.groupby('Class')['F8'].mean().reset_index()
df_mean.columns = ['Class','F8_mean']
df = pd.merge(df, df_mean, on='Class', how='left')
df.loc[df['F8'].isnull(), 'F8'] = df['F8_mean']
df.drop('F8_mean', axis=1, inplace=True)
#import libries
import pandas as pd
import numpy as np
# Data dictionary
data_dict = {'Country': ['USA','UK','MAL','MAL','MAL','MAL','MAL','MAL'],
'Age': ['52','23','25','25','?','25','25','?'], 'Sal': ['12345','1142','4456','4456','2345','3342','3452','3562'],
'OnWork': ['No','Yes','No','No','Yes','Yes','No','No']}
# Convert dictionary to dataframe
df = pd.DataFrame(data_dict)
# print input df
print(df)
**
Country Age Sal OnWork
0 USA 52 12345 No
1 UK 23 1142 Yes
2 MAL 25 4456 No
3 MAL 25 4456 No
4 MAL ? 2345 Yes
5 MAL 25 3342 Yes
6 MAL 25 3452 No
7 MAL ? 3562 No
**
# '?' Values replace with NaN
df.Age=df.Age.where(df.Age!='?')
# Convert string values to numeric
df["Age"] = pd.to_numeric(df["Age"])
# Get mean values Separately
mean_list = df.groupby('OnWork')['Age'].mean().astype(int)
# print mean values
print(mean_list)
**
No 31
Yes 24
**
# Replace the missing age value
df['Age'] = df.apply(
lambda row: mean_list['Yes'] if np.isnan(row['Age'])&(row['OnWork']=='Yes') else mean_list['No'] if np.isnan(row['Age'])&(row['OnWork']=='No') else row['Age'],
axis=1
)
# print final df
print(df)
**
Country Age Sal OnWork
0 USA 52.0 12345 No
1 UK 23.0 1142 Yes
2 MAL 25.0 4456 No
3 MAL 25.0 4456 No
4 MAL 24.0 2345 Yes
5 MAL 25.0 3342 Yes
6 MAL 25.0 3452 No
7 MAL 31.0 3562 No
**
Would start by adjusting the dataframe:
-
Replace the ?
with numpy.NaN
df.replace('?', np.nan, inplace=True)
-
Convert the Age
column to numeric with pandas.to_numeric:
df['Age'] = pd.to_numeric(df['Age'])
Then, with those changes, one can use pandas.DataFrame.groupby
and pandas.Series.transform
with a custom lambda function as follows
df['Age'] = df.groupby('OnWork')['Age'].transform(lambda x: x.fillna(x.mean())).astype('int')
[Out]:
Country Age Sal OnWork
0 USA 52 12345 No
1 UK 23 1142 Yes
2 MAL 25 4456 No
3 MAL 25 4456 No
4 MAL 24 2345 Yes
5 MAL 25 3342 Yes
6 MAL 25 3452 No
7 MAL 31 3562 No
Notes:
.astype('int')
is to make sure that the column Age
is of integer type.
{'Country': 'USA', 'Age': '52', 'Sal': '12345', 'OnWork': 'No'}
{'Country': 'UK', 'Age': '23', 'Sal': '1142', 'OnWork': 'Yes'}
{'Country': 'MAL', 'Age': '25', 'Sal': '4456', 'OnWork': 'No'}
{'Country': 'MAL', 'Age': '25', 'Sal': '4456', 'OnWork': 'No'}
{'Country': 'MAL', 'Age': '?', 'Sal': '2345', 'OnWork': 'Yes'}
{'Country': 'MAL', 'Age': '25', 'Sal': '3342', 'OnWork': 'Yes'}
{'Country': 'MAL', 'Age': '25', 'Sal': '3452', 'OnWork': 'No'}
{'Country': 'MAL', 'Age': '?', 'Sal': '3562', 'OnWork': 'No'}
Here I have to replace the missing mean value bases on “OnWork” value. Group Yes and its mean value go to Row5 Age. Group NO and its value should go to the Last row.
df = pd.read_csv("Mycal.csv", na_values = missing_values, nrows=50)
Find and replace the Mean value (This is working)
df["F8"].fillna(df['F8'].mean(), inplace=True)
here I am able to find the Mean value, However I am not able to replace it.
df[df["Class"]=="Yes"]["F8"].mean()
I am expecting the Yes values should group and fill Missing value the Mean to fill same for NO. Kindly help me with this
df['Age'] = df['Age'].mask(df['Age'].eq('?'), np.nan).astype(float)
df['Age'] = (df['Age'].fillna(df.groupby('OnWork')['Age'].transform(np.nanmean))
.astype(int))
print(df)
Country Age Sal OnWork
0 USA 52 12345 No
1 UK 23 1142 Yes
2 MAL 25 4456 No
3 MAL 25 4456 No
4 MAL 24 2345 Yes
5 MAL 25 3342 Yes
6 MAL 25 3452 No
7 MAL 31 3562 No
If you want to replace multiple column values at once use:
df = df.fillna(df.groupby('OnWork').transform('mean'))
If you mean to replace missing values by average for each group, then here is one of the solution:
df_mean = df.groupby('Class')['F8'].mean().reset_index()
df_mean.columns = ['Class','F8_mean']
df = pd.merge(df, df_mean, on='Class', how='left')
df.loc[df['F8'].isnull(), 'F8'] = df['F8_mean']
df.drop('F8_mean', axis=1, inplace=True)
#import libries
import pandas as pd
import numpy as np
# Data dictionary
data_dict = {'Country': ['USA','UK','MAL','MAL','MAL','MAL','MAL','MAL'],
'Age': ['52','23','25','25','?','25','25','?'], 'Sal': ['12345','1142','4456','4456','2345','3342','3452','3562'],
'OnWork': ['No','Yes','No','No','Yes','Yes','No','No']}
# Convert dictionary to dataframe
df = pd.DataFrame(data_dict)
# print input df
print(df)
**
Country Age Sal OnWork
0 USA 52 12345 No
1 UK 23 1142 Yes
2 MAL 25 4456 No
3 MAL 25 4456 No
4 MAL ? 2345 Yes
5 MAL 25 3342 Yes
6 MAL 25 3452 No
7 MAL ? 3562 No
**
# '?' Values replace with NaN
df.Age=df.Age.where(df.Age!='?')
# Convert string values to numeric
df["Age"] = pd.to_numeric(df["Age"])
# Get mean values Separately
mean_list = df.groupby('OnWork')['Age'].mean().astype(int)
# print mean values
print(mean_list)
**
No 31
Yes 24
**
# Replace the missing age value
df['Age'] = df.apply(
lambda row: mean_list['Yes'] if np.isnan(row['Age'])&(row['OnWork']=='Yes') else mean_list['No'] if np.isnan(row['Age'])&(row['OnWork']=='No') else row['Age'],
axis=1
)
# print final df
print(df)
**
Country Age Sal OnWork
0 USA 52.0 12345 No
1 UK 23.0 1142 Yes
2 MAL 25.0 4456 No
3 MAL 25.0 4456 No
4 MAL 24.0 2345 Yes
5 MAL 25.0 3342 Yes
6 MAL 25.0 3452 No
7 MAL 31.0 3562 No
**
Would start by adjusting the dataframe:
-
Replace the
?
withnumpy.NaN
df.replace('?', np.nan, inplace=True)
-
Convert the
Age
column to numeric with pandas.to_numeric:df['Age'] = pd.to_numeric(df['Age'])
Then, with those changes, one can use pandas.DataFrame.groupby
and pandas.Series.transform
with a custom lambda function as follows
df['Age'] = df.groupby('OnWork')['Age'].transform(lambda x: x.fillna(x.mean())).astype('int')
[Out]:
Country Age Sal OnWork
0 USA 52 12345 No
1 UK 23 1142 Yes
2 MAL 25 4456 No
3 MAL 25 4456 No
4 MAL 24 2345 Yes
5 MAL 25 3342 Yes
6 MAL 25 3452 No
7 MAL 31 3562 No
Notes:
.astype('int')
is to make sure that the columnAge
is of integer type.