DataFrame: impute column with the median value of each category
Question:
My dataset looks like:
df = pd.DataFrame(
{'sensor1': [1.94,0.93,0.98,1.75,1.75,3.25,0.5,0.5,5.59,6.02,9.21,4.54,3.71,1.05],
'sensor2': [-0.91,0.42,-0.11,0.0,0.0,-0.12,0.0,0.0,0.48,0.26,-1.5,-0.75,-1.45,0.06],
'sensor3': [18,19,20,-2094,-2094,17,17,17,-985,-985,1163,1163,1163,-1265],
'type': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3]}
)
df.head()
sensor1 sensor2 sensor3 type
0 1.94 -0.91 18 1
1 0.93 0.42 19 1
2 0.98 -0.11 20 1
3 1.75 0.00 -2094 1
4 1.75 0.00 -2094 1
I negative value from output of sensor3
means invalid. I would like to impute all invalid values in that column, using the median
value of each type
category.
Required
sensor1 sensor2 sensor3 type
0 1.94 -0.91 18 1
1 0.93 0.42 19 1
2 0.98 -0.11 20 1
3 1.75 0.00 19 1
4 1.75 0.00 19 1
5 3.25 -0.12 17 2
6 0.50 0.00 17 2
7 0.50 0.00 17 2
8 5.59 0.48 17 2
9 6.02 0.26 17 2
10 9.21 -1.50 1163 3
11 4.54 -0.75 1163 3
12 3.71 -1.45 1163 3
13 1.05 0.06 1163 3
Answers:
Create mask for identify invalid values, here less like 0
by Series.lt
, then filter rows with inverted mask and aggregate median
, last filter rows in DataFrame.loc
and mapping by medians by Series.map
:
m = df['sensor3'].lt(0)
df.loc[m, 'sensor3'] = df.loc[m, 'type'].map(df[~m].groupby('type')['sensor3'].median())
print (df)
sensor1 sensor2 sensor3 type
0 1.94 -0.91 18 1
1 0.93 0.42 19 1
2 0.98 -0.11 20 1
3 1.75 0.00 19 1
4 1.75 0.00 19 1
5 3.25 -0.12 17 2
6 0.50 0.00 17 2
7 0.50 0.00 17 2
8 5.59 0.48 17 2
9 6.02 0.26 17 2
10 9.21 -1.50 1163 3
11 4.54 -0.75 1163 3
12 3.71 -1.45 1163 3
13 1.05 0.06 1163 3
My dataset looks like:
df = pd.DataFrame(
{'sensor1': [1.94,0.93,0.98,1.75,1.75,3.25,0.5,0.5,5.59,6.02,9.21,4.54,3.71,1.05],
'sensor2': [-0.91,0.42,-0.11,0.0,0.0,-0.12,0.0,0.0,0.48,0.26,-1.5,-0.75,-1.45,0.06],
'sensor3': [18,19,20,-2094,-2094,17,17,17,-985,-985,1163,1163,1163,-1265],
'type': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3]}
)
df.head()
sensor1 sensor2 sensor3 type
0 1.94 -0.91 18 1
1 0.93 0.42 19 1
2 0.98 -0.11 20 1
3 1.75 0.00 -2094 1
4 1.75 0.00 -2094 1
I negative value from output of sensor3
means invalid. I would like to impute all invalid values in that column, using the median
value of each type
category.
Required
sensor1 sensor2 sensor3 type
0 1.94 -0.91 18 1
1 0.93 0.42 19 1
2 0.98 -0.11 20 1
3 1.75 0.00 19 1
4 1.75 0.00 19 1
5 3.25 -0.12 17 2
6 0.50 0.00 17 2
7 0.50 0.00 17 2
8 5.59 0.48 17 2
9 6.02 0.26 17 2
10 9.21 -1.50 1163 3
11 4.54 -0.75 1163 3
12 3.71 -1.45 1163 3
13 1.05 0.06 1163 3
Create mask for identify invalid values, here less like 0
by Series.lt
, then filter rows with inverted mask and aggregate median
, last filter rows in DataFrame.loc
and mapping by medians by Series.map
:
m = df['sensor3'].lt(0)
df.loc[m, 'sensor3'] = df.loc[m, 'type'].map(df[~m].groupby('type')['sensor3'].median())
print (df)
sensor1 sensor2 sensor3 type
0 1.94 -0.91 18 1
1 0.93 0.42 19 1
2 0.98 -0.11 20 1
3 1.75 0.00 19 1
4 1.75 0.00 19 1
5 3.25 -0.12 17 2
6 0.50 0.00 17 2
7 0.50 0.00 17 2
8 5.59 0.48 17 2
9 6.02 0.26 17 2
10 9.21 -1.50 1163 3
11 4.54 -0.75 1163 3
12 3.71 -1.45 1163 3
13 1.05 0.06 1163 3